[2024]Digital Global Overview Report 2024 Meltwater.pdf
Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers
1. 1
Ceph on All-Flash Storage –
Breaking Performance Barriers
Venkat Kolli
Dir. of Product Management
November, 2015
2. Forward-Looking Statements
During our meeting today we will make forward-looking statements.
Any statement that refers to expectations, projections or other characterizations of future events or
circumstances is a forward-looking statement, including those relating to market growth, industry
trends, future products, product performance and product capabilities. This presentation also
contains forward-looking statements attributed to third parties, which reflect their projections as of the
date of issuance.
Actual results may differ materially from those expressed in these forward-looking statements due
to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors”
and elsewhere in the documents we file from time to time with the SEC, including our annual and
quarterly reports.
We undertake no obligation to update these forward-looking statements, which speak only as
of the date hereof or as of the date of issuance by a third party, as the case may be.
4. Designed for Big Data Workloads @ PB Scale
Mixed media container, active-
archiving, backup, locality of data
Large containers with application
SLAs
Internet of Things, Sensor
Analytics
Time-to-Value and Time-to-Insight
Hadoop
NoSQL
Cassandra
MongoDB
High read intensive access from
billions of edge devices
Hi-Def video driving even greater
demand for capacity and
performance
Surveillance systems, analytics
CONTENT REPOSITORIES BIG DATA ANALYTICS MEDIA SERVICES
5. Flash Card Performance**
Read Throughput > 400MB/s
Read IOPS > 20K IOPS
Random Read/Write
@4K- 90/10 > 15K IOPS
Flash Card Integration
Alerts and monitoring
Latching integrated
and monitored
Integrated air temperature
sampling
InfiniFlash System
Capacity 512TB* raw
All-Flash 3U Storage System
64 x 8TB Flash Cards with Pfail
8 SAS ports total
Operational Efficiency and Resilient
Hot Swappable components, Easy
FRU
Low power 450W(avg), 750W(active)
MTBF 1.5+ million hours
Scalable Performance**
1M IOPS
6-9GB/s Throughput
Upgrade to 12-15GB/s 1.5+M IOPS in
11/15
* 1TB = 1,000,000,000,000 bytes. Actual user capacity less.
** Based on internal testing of InfiniFlash 100. Test report available.
6. InfiniFlash IF500 All-Flash Storage System
Block and Object Storage Powered by Ceph
Ultra-dense High Capacity Flash storage
• 512TB in 3U, Scale-out software for PB scale capacity
Highly scalable performance
• Industry leading IOPS/TB
Cinder, Glance and Swift storage
• Add/remove server & capacity on-demand
Enterprise-Class storage features
• Automatic rebalancing
• Hot Software upgrade
• Snapshots, replication, thin provisioning
• Fully hot swappable, redundant
Ceph Optimized for SanDisk flash
• Tuned & Hardened for InfiniFlash
7. 7
Began in summer of ‘13 with the Ceph Dumpling release
Ceph optimized for HDD
• Tuning AND algorithm changes needed for Flash optimization
• Leave defaults for HDD
Quickly determined that the OSD was the major bottleneck
• OSD maxed out at about 1000 IOPS on fastest CPUs (using ~4.5 cores)
Examined and rejected multiple OSDs per SSD
• Failure Domain / Crush rules would be a nightmare
Optimizing Ceph for the all-flash Future
8. 8
Context switches matter at flash rates
• Too much “put it in a queue for a another thread”
• Too much lock contention
Socket handling matters too!
• Too many “get 1 byte” calls to the kernel for sockets
• Disable Nagle’s algorithm to shorten operation latency
Lots of other simple things
• Eliminate repeated look-ups in maps, caches, etc.
• Eliminate Redundant string copies (especially return of string)
• Large variables passed by value, not const reference
Contributed improvements to Emperor, Firefly and Giant releases
Now obtain about >80K IOPS / OSD using around 9 CPU cores/OSD (Hammer) *
SanDisk: OSD Read path Optimization
* Internal testing normalized from 3 OSDs / 132GB DRAM / 8 Clients / 2.2 GHz XEON 2x8 Cores / Optimus Max SSDs
13. InfiniFlash Performance Advantage
900K Random Read Performance with 384TB of storage
Flash Performance unleashed
• Out-of-the Box configurations tuned for
performance with Flash
• Read & Write data-path changes for Flash
• x3-12 block performance improvement –
depending on workload
• Almost linear performance scale with
addition of InfiniFlash nodes
• Write performance WIP with NV-RAM
Journals• Measured with 3 InfiniFlash nodes with 128TB each
• Avg Latency with 4K Block is ~2ms, with 99.9 percentile
latency is under 10ms
• For Lower block size, performance is CPU bound at Storage
Node.
• Maximum Bandwidth of 12.2GB/s measured towards 64KB
blocks
S
13
14. 14
Write path strategy was classic HDD
• Journal writes for minimum foreground latency
• Process journal in batches in the background
The batch oriented processing was very inefficient on flash
Modified buffering/writing strategy for Flash
• Recently committed to Jewel release
• Yields 2.5x write throughput improvement over Hammer
• Average latency is ½ of Hammer
SanDisk: OSD Write path Optimization
15. 15
RDMA intra-cluster communication
• Significant reduction in CPU / IOP
NewStore
• Significant reduction in write amplification -> even higher write
performance
Memory allocation
• tcmalloc/jemalloc/AsyncMessenger tuning shows up to 3x IOPS vs.
default *
* https://drive.google.com/file/d/0B2gTBZrkrnpZY3U3TUU3RkJVeVk/view
SanDisk: Potential Future Improvements
16. Enhancing CEPH for Enterprise Consumption
Open Source Ceph
+ SanDisk
Performance
patches
Out-of-the Box
configurations tuned for
performance with Flash
Sizing & Planning Tool
Higher node resiliency with
Multi-Path support
Persistent reservations of
drives to nodes
Ceph Installer that is specifically built for InfiniFlash
High Performance iSCSI Storage
Better Diagnostics with Log Collection Tool
Enterprise hardened QA @scale
InfiniFlash Drive Management integrated into Ceph Management (Coming Soon)
SanDisk adds Usability & Performance utilities without sacrificing Open Source Principles
IFOS = Community Ceph Distribution + Utilities
All Ceph Performance improvements developed by SanDisk are contributed back to community
17. Open Source with SanDisk Advantage
InfiniFlash OS – Enterprise Level Hardened Ceph
Innovation and speed of Open Source with
the trustworthiness of Enterprise grade and
Web-Scale testing, hardware optimization
Performance optimization for flash and
hardware tuning
Hardened and tested for Hyperscale
deployments and workloads
Enterprise class support and services
from SanDisk
Risk mitigation through long term support
and a reliable long term roadmap
Continual contribution back to the community
Enterprise Level
Hardening
Testing at
Hyperscale
Failure
Testing
9,000 hours
of cumulative
IO tests
1,100+
unique test
cases
1,000 hours
of Cluster
Rebalancing
tests
1,000 hours
of IO on iSCSI
Over 100
server node
clusters
Over 4PB of
Flash Storage
2,000 Cycle
Node Reboot
1,000 times
Node Abrupt
Power Cycle
1,000 times
Storage Failure
1,000 times
Network
Failure
IO for 250
hours at a
stretch
18. InfiniFlash for OpenStack with Dis-Aggregation
Compute & Storage Disaggregation enables
Optimal Resource utilization
Allows for more CPU usage required for OSDs with
small Block workloads
Allows for higher bandwidth provisioning as required
for large Object workload
Independent Scaling of Compute and
Storage
Higher Storage capacity needs doesn't’t force you to
add more compute and vice-versa
Leads to optimal ROI for PB scale
OpenStack deploymentsHSEB A HSEB B
OSDs
SAS
….
HSEB A HSEB B HSEB A HSEB B
….
ComputeFarm
LUN LUN
iSCSI Storage
…Obj Obj
Swift ObjectStore
…LUN LUN
Nova with Cinder
& Glance
…
LibRBD
QEMU/KVM
RGW
WebServer
KRBD
iSCSI Target
OSDs OSDs OSDs OSDs OSDs
StorageFarm
Confidential – EMS Product Management
19. Flash + HDD with Data Tier-ing
Flash Performance with TCO of HDD
InfiniFlash OS performs automatic data
placement and data movement between tiers
based transparent to Applications
User defined Policies for data placement on
tiers
Can be used with Erasure coding to further
reduce the TCO
Benefits
Flash based performance with HDD like TCO
Lower performance requirements on HDD tier
enables use of denser and cheaper SMR drives
Denser and lower power compared to HDD only
solution
InfiniFlash for High Activity data and SMR drives
for Low activity data
60+ HDD per Server
Compute Farm
20. Flash Primary + HDD Replicas
Flash Performance with TCO of HDD
Primary replica on
InfiniFlash
HDD based data node
for 2nd local replica
HDD based data node
for 3rd DR replica
Higher Affinity of the Primary Replica ensures much
of the compute is on InfiniFlash Data
2nd and 3rd replicas on HDDs are primarily for data
protection
High throughput of InfiniFlash provides data
protection, movement for all replicas without
impacting application IO
Eliminates cascade data propagation requirement
for HDD replicas
Flash-based accelerated Object performance for
Replica 1 allows for denser and cheaper SMR HDDs
for Replica 2 and 3
Compute Farm
21. InfiniFlash TCO Advantage
$-
$10,000,000
$20,000,000
$30,000,000
$40,000,000
$50,000,000
$60,000,000
$70,000,000
$80,000,000
Tradtional ObjStore on
HDD
IF500 ObjStore w/ 3
Full Replicas on Flash
IF500 w/ EC - All Flash IF500 - Flash Primary
& HDD Copies
3 year TCO comparison *
3 year Opex
TCA
0
20
40
60
80
100
Tradtional ObjStore on HDD IF500 ObjStore w/ 3 Full
Replicas on Flash
IF500 w/ EC - All Flash IF500 - Flash Primary & HDD
Copies
Total Rack
Reduce the replica count with higher
reliability of flash
- 2 copies on InfiniFlash vs. 3 copies on
HDD
InfiniFlash disaggregated architecture
reduces compute usage, thereby
reducing HW & SW costs
- Flash allows the use of erasure coded
storage pool without performance
limitations
- Protection equivalent of 2x storage with
only 1.2x storage
Power, real estate, maintenance cost
savings over 5 year TCO
* TCO analysis based on a US customer’s OPEX & Cost data for a 100PB deployment
21
Video continues to drive the need for storage, and Point-Of-View cameras like GoPro are producing compelling high resolution videos on our performance cards. People using smartphones to make high resolution videos choose our performance mobile cards also, driving the need for higher capacities.
There is a growing customer base for us around the world, with one billion additional people joining the Global Middle Class between 2013 and 2020. These people will use smart mobile devices as their first choice to spend discretionary income on, and will expand their storage using removable cards and USB drives.
We are not standing still, but creating new product categories to allow people to expand and share their most cherished memories.
___________________________________________________________
X2 is 2 bits per cell, X3 is 3 bits per cell
Performance: Shorter jobs by 4x per study, flash enablement)
Share Compute with other infrastructure (win for any company with seasonality).
Flexible & Elastic Storage Platform to handle MapReduce load spikes