SlideShare ist ein Scribd-Unternehmen logo
1 von 22
1
Ceph on All-Flash Storage –
Breaking Performance Barriers
Venkat Kolli
Dir. of Product Management
November, 2015
Forward-Looking Statements
During our meeting today we will make forward-looking statements.
Any statement that refers to expectations, projections or other characterizations of future events or
circumstances is a forward-looking statement, including those relating to market growth, industry
trends, future products, product performance and product capabilities. This presentation also
contains forward-looking statements attributed to third parties, which reflect their projections as of the
date of issuance.
Actual results may differ materially from those expressed in these forward-looking statements due
to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors”
and elsewhere in the documents we file from time to time with the SEC, including our annual and
quarterly reports.
We undertake no obligation to update these forward-looking statements, which speak only as
of the date hereof or as of the date of issuance by a third party, as the case may be.
InfiniFlash™ System
The First All-Flash Storage System Built for High Performance Ceph
Designed for Big Data Workloads @ PB Scale
 Mixed media container, active-
archiving, backup, locality of data
 Large containers with application
SLAs
 Internet of Things, Sensor
Analytics
 Time-to-Value and Time-to-Insight
 Hadoop
 NoSQL
 Cassandra
 MongoDB
 High read intensive access from
billions of edge devices
 Hi-Def video driving even greater
demand for capacity and
performance
 Surveillance systems, analytics
CONTENT REPOSITORIES BIG DATA ANALYTICS MEDIA SERVICES
Flash Card Performance**
 Read Throughput > 400MB/s
 Read IOPS > 20K IOPS
 Random Read/Write
@4K- 90/10 > 15K IOPS
Flash Card Integration
 Alerts and monitoring
 Latching integrated
and monitored
 Integrated air temperature
sampling
InfiniFlash System
Capacity 512TB* raw
 All-Flash 3U Storage System
 64 x 8TB Flash Cards with Pfail
 8 SAS ports total
Operational Efficiency and Resilient
 Hot Swappable components, Easy
FRU
 Low power 450W(avg), 750W(active)
 MTBF 1.5+ million hours
Scalable Performance**
 1M IOPS
 6-9GB/s Throughput
 Upgrade to 12-15GB/s 1.5+M IOPS in
11/15
* 1TB = 1,000,000,000,000 bytes. Actual user capacity less.
** Based on internal testing of InfiniFlash 100. Test report available.
InfiniFlash IF500 All-Flash Storage System
Block and Object Storage Powered by Ceph
 Ultra-dense High Capacity Flash storage
• 512TB in 3U, Scale-out software for PB scale capacity
 Highly scalable performance
• Industry leading IOPS/TB
 Cinder, Glance and Swift storage
• Add/remove server & capacity on-demand
 Enterprise-Class storage features
• Automatic rebalancing
• Hot Software upgrade
• Snapshots, replication, thin provisioning
• Fully hot swappable, redundant
 Ceph Optimized for SanDisk flash
• Tuned & Hardened for InfiniFlash
7
 Began in summer of ‘13 with the Ceph Dumpling release
 Ceph optimized for HDD
• Tuning AND algorithm changes needed for Flash optimization
• Leave defaults for HDD
 Quickly determined that the OSD was the major bottleneck
• OSD maxed out at about 1000 IOPS on fastest CPUs (using ~4.5 cores)
 Examined and rejected multiple OSDs per SSD
• Failure Domain / Crush rules would be a nightmare
Optimizing Ceph for the all-flash Future
8
 Context switches matter at flash rates
• Too much “put it in a queue for a another thread”
• Too much lock contention
 Socket handling matters too!
• Too many “get 1 byte” calls to the kernel for sockets
• Disable Nagle’s algorithm to shorten operation latency
 Lots of other simple things
• Eliminate repeated look-ups in maps, caches, etc.
• Eliminate Redundant string copies (especially return of string)
• Large variables passed by value, not const reference
 Contributed improvements to Emperor, Firefly and Giant releases
 Now obtain about >80K IOPS / OSD using around 9 CPU cores/OSD (Hammer) *
SanDisk: OSD Read path Optimization
* Internal testing normalized from 3 OSDs / 132GB DRAM / 8 Clients / 2.2 GHz XEON 2x8 Cores / Optimus Max SSDs
Test Configuration – Single InfiniFlash System
Test Configuration – Single InfiniFlash System
Performance Config - InfiniFlash with 2 storage controller configuration
2 Node Cluster ( 32 drives shared to each OSD node)
Node
2 Servers
(Dell R720) 2x E5-2680 8C 2.8GHz 25M$ 4x 16GB RDIMM, dual rank x4 (64GB)
1x Mellanox X3 Dual 40GbE 1x LSI 9207 HBA card
RBD Client
4 Servers
(Dell R620)
2 x E5-2680 10C 2.8GHz 25M$ 2 x 16GB RDIMM, dual rank x4 (32 GB) 1x
Mellanox X3 Dual 40GbE
Storage - InfiniFlash with 512TB with 2 OSD servers
InfiniFlash
1-InfiniFlash is connected 64 x 1YX2 Icechips in A2
topology.
Total storage - 64 * 8 tb = 512tb ( effective 430 TB )
InfiniFlash- fw details FFU 1.0.0.31.1
Network Details
40G Switch NA
OS Details
OS Ubuntu 14.04 LTS 64bit 3.13.0-32
LSI card/ driver SAS2308(9207) mpt2sas
Mellanox 40gbps nw card MT27500 [ConnectX-3] mlx4_en - 2.2-1 (Feb 2014)
Cluster Configuration
CEPH Version sndk-ifos-1.0.0.04 0.86.rc.eap2
Replication (Default)
2 [Host]
Note: - Host level replication.
Number of Pools, PGs & RBDs
pool = 4 ;PG = 2048 per pool
2 RBDs from each pool
RBD size 2TB
Number of Monitors 1
Number of OSD Nodes 2
Number of OSDs per Node 32 total OSDs = 32 * 2 = 64
Performance Improvement: Stock Ceph vs IF OS
8K Random Blocks
Read Performance improves 3x to 12x depending on the Block size
Top Row: Queue Depth
Bottom Row: % Read IOs
IOPS
Avglatenv(ms)
Avg Latency
0
50000
100000
150000
200000
250000
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
Stock Ceph
(Giant)
IFOS 1.0
0
20
40
60
80
100
120
1 4 16 1 4 16 1 4 16 1 4 16 1 4 16
0 25 50 75 100
• 2 RBD/Client x Total 4 Clients
• 1 InfiniFlash node with 512TB
IOPS
Top Row: Queue Depth
Bottom Row: % Read IOs
Test Configuration – 3 InfiniFlash Systems (128TB each)
InfiniFlash Performance Advantage
900K Random Read Performance with 384TB of storage
Flash Performance unleashed
• Out-of-the Box configurations tuned for
performance with Flash
• Read & Write data-path changes for Flash
• x3-12 block performance improvement –
depending on workload
• Almost linear performance scale with
addition of InfiniFlash nodes
• Write performance WIP with NV-RAM
Journals• Measured with 3 InfiniFlash nodes with 128TB each
• Avg Latency with 4K Block is ~2ms, with 99.9 percentile
latency is under 10ms
• For Lower block size, performance is CPU bound at Storage
Node.
• Maximum Bandwidth of 12.2GB/s measured towards 64KB
blocks
S
13
14
 Write path strategy was classic HDD
• Journal writes for minimum foreground latency
• Process journal in batches in the background
 The batch oriented processing was very inefficient on flash
 Modified buffering/writing strategy for Flash
• Recently committed to Jewel release
• Yields 2.5x write throughput improvement over Hammer
• Average latency is ½ of Hammer
SanDisk: OSD Write path Optimization
15
 RDMA intra-cluster communication
• Significant reduction in CPU / IOP
 NewStore
• Significant reduction in write amplification -> even higher write
performance
 Memory allocation
• tcmalloc/jemalloc/AsyncMessenger tuning shows up to 3x IOPS vs.
default *
* https://drive.google.com/file/d/0B2gTBZrkrnpZY3U3TUU3RkJVeVk/view
SanDisk: Potential Future Improvements
Enhancing CEPH for Enterprise Consumption
Open Source Ceph
+ SanDisk
Performance
patches
 Out-of-the Box
configurations tuned for
performance with Flash
 Sizing & Planning Tool
 Higher node resiliency with
Multi-Path support
 Persistent reservations of
drives to nodes
 Ceph Installer that is specifically built for InfiniFlash
 High Performance iSCSI Storage
 Better Diagnostics with Log Collection Tool
 Enterprise hardened QA @scale
 InfiniFlash Drive Management integrated into Ceph Management (Coming Soon)
SanDisk adds Usability & Performance utilities without sacrificing Open Source Principles
IFOS = Community Ceph Distribution + Utilities
All Ceph Performance improvements developed by SanDisk are contributed back to community
Open Source with SanDisk Advantage
InfiniFlash OS – Enterprise Level Hardened Ceph
 Innovation and speed of Open Source with
the trustworthiness of Enterprise grade and
Web-Scale testing, hardware optimization
 Performance optimization for flash and
hardware tuning
 Hardened and tested for Hyperscale
deployments and workloads
 Enterprise class support and services
from SanDisk
 Risk mitigation through long term support
and a reliable long term roadmap
 Continual contribution back to the community
Enterprise Level
Hardening
Testing at
Hyperscale
Failure
Testing
 9,000 hours
of cumulative
IO tests
 1,100+
unique test
cases
 1,000 hours
of Cluster
Rebalancing
tests
 1,000 hours
of IO on iSCSI
 Over 100
server node
clusters
 Over 4PB of
Flash Storage
 2,000 Cycle
Node Reboot
 1,000 times
Node Abrupt
Power Cycle
 1,000 times
Storage Failure
 1,000 times
Network
Failure
 IO for 250
hours at a
stretch
InfiniFlash for OpenStack with Dis-Aggregation
 Compute & Storage Disaggregation enables
Optimal Resource utilization
 Allows for more CPU usage required for OSDs with
small Block workloads
 Allows for higher bandwidth provisioning as required
for large Object workload
 Independent Scaling of Compute and
Storage
 Higher Storage capacity needs doesn't’t force you to
add more compute and vice-versa
 Leads to optimal ROI for PB scale
OpenStack deploymentsHSEB A HSEB B
OSDs
SAS
….
HSEB A HSEB B HSEB A HSEB B
….
ComputeFarm
LUN LUN
iSCSI Storage
…Obj Obj
Swift ObjectStore
…LUN LUN
Nova with Cinder
& Glance
…
LibRBD
QEMU/KVM
RGW
WebServer
KRBD
iSCSI Target
OSDs OSDs OSDs OSDs OSDs
StorageFarm
Confidential – EMS Product Management
Flash + HDD with Data Tier-ing
Flash Performance with TCO of HDD
 InfiniFlash OS performs automatic data
placement and data movement between tiers
based transparent to Applications
 User defined Policies for data placement on
tiers
 Can be used with Erasure coding to further
reduce the TCO
Benefits
 Flash based performance with HDD like TCO
 Lower performance requirements on HDD tier
enables use of denser and cheaper SMR drives
 Denser and lower power compared to HDD only
solution
 InfiniFlash for High Activity data and SMR drives
for Low activity data
 60+ HDD per Server
Compute Farm
Flash Primary + HDD Replicas
Flash Performance with TCO of HDD
Primary replica on
InfiniFlash
HDD based data node
for 2nd local replica
HDD based data node
for 3rd DR replica
 Higher Affinity of the Primary Replica ensures much
of the compute is on InfiniFlash Data
 2nd and 3rd replicas on HDDs are primarily for data
protection
 High throughput of InfiniFlash provides data
protection, movement for all replicas without
impacting application IO
 Eliminates cascade data propagation requirement
for HDD replicas
 Flash-based accelerated Object performance for
Replica 1 allows for denser and cheaper SMR HDDs
for Replica 2 and 3
Compute Farm
InfiniFlash TCO Advantage
$-
$10,000,000
$20,000,000
$30,000,000
$40,000,000
$50,000,000
$60,000,000
$70,000,000
$80,000,000
Tradtional ObjStore on
HDD
IF500 ObjStore w/ 3
Full Replicas on Flash
IF500 w/ EC - All Flash IF500 - Flash Primary
& HDD Copies
3 year TCO comparison *
3 year Opex
TCA
0
20
40
60
80
100
Tradtional ObjStore on HDD IF500 ObjStore w/ 3 Full
Replicas on Flash
IF500 w/ EC - All Flash IF500 - Flash Primary & HDD
Copies
Total Rack
 Reduce the replica count with higher
reliability of flash
- 2 copies on InfiniFlash vs. 3 copies on
HDD
 InfiniFlash disaggregated architecture
reduces compute usage, thereby
reducing HW & SW costs
- Flash allows the use of erasure coded
storage pool without performance
limitations
- Protection equivalent of 2x storage with
only 1.2x storage
 Power, real estate, maintenance cost
savings over 5 year TCO
* TCO analysis based on a US customer’s OPEX & Cost data for a 100PB deployment
21
22
Thank You! @BigDataFlash
#bigdataflash
©2015 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. InfiniFlash is a trademarks of SanDisk Enterprise IP
LLC. All other product and company names are used for identification purposes and may be trademarks of their respective holder(s).

Weitere ähnliche Inhalte

Was ist angesagt?

Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Community
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Community
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Community
 

Was ist angesagt? (19)

Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
 
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
 
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoCCeph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
 
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
Intel - optimizing ceph performance by leveraging intel® optane™ and 3 d nand...
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
Ceph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA UpdateCeph Day Beijing - Ceph RDMA Update
Ceph Day Beijing - Ceph RDMA Update
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash Storage
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 

Andere mochten auch (6)

openQRM V5.1 Update
openQRM V5.1 UpdateopenQRM V5.1 Update
openQRM V5.1 Update
 
AWS/Openstack integration with openQRM
AWS/Openstack integration with openQRMAWS/Openstack integration with openQRM
AWS/Openstack integration with openQRM
 
Excelからのクラウドオーケストレーション
ExcelからのクラウドオーケストレーションExcelからのクラウドオーケストレーション
Excelからのクラウドオーケストレーション
 
オープンソースNW監視ツールのご紹介
オープンソースNW監視ツールのご紹介オープンソースNW監視ツールのご紹介
オープンソースNW監視ツールのご紹介
 
JobSchedulerアップデート2016
JobSchedulerアップデート2016JobSchedulerアップデート2016
JobSchedulerアップデート2016
 
「今、ヨーロッパのオープンソースがアツい!」 クラウドの構成管理を自動化する基盤CMDBuild
「今、ヨーロッパのオープンソースがアツい!」クラウドの構成管理を自動化する基盤CMDBuild「今、ヨーロッパのオープンソースがアツい!」クラウドの構成管理を自動化する基盤CMDBuild
「今、ヨーロッパのオープンソースがアツい!」 クラウドの構成管理を自動化する基盤CMDBuild
 

Ähnlich wie Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers

Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Community
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red_Hat_Storage
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
Internet World
 

Ähnlich wie Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers (20)

Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage
 
Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
 
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance BarriersCeph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
Ceph Day Beijing - Ceph on All-Flash Storage - Breaking Performance Barriers
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
Accelerating Ceph Performance with High Speed Networks and Protocols - Qingch...
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
 
How Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterHow Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver Cluster
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Nexenta at VMworld Hands-on Lab
Nexenta at VMworld Hands-on LabNexenta at VMworld Hands-on Lab
Nexenta at VMworld Hands-on Lab
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
 
Ceph Day Beijing - SPDK in Ceph
Ceph Day Beijing - SPDK in CephCeph Day Beijing - SPDK in Ceph
Ceph Day Beijing - SPDK in Ceph
 
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for CephCeph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK for Ceph
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Ceph Day Melbourne - Ceph on All-Flash Storage - Breaking Performance Barriers

  • 1. 1 Ceph on All-Flash Storage – Breaking Performance Barriers Venkat Kolli Dir. of Product Management November, 2015
  • 2. Forward-Looking Statements During our meeting today we will make forward-looking statements. Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to market growth, industry trends, future products, product performance and product capabilities. This presentation also contains forward-looking statements attributed to third parties, which reflect their projections as of the date of issuance. Actual results may differ materially from those expressed in these forward-looking statements due to a number of risks and uncertainties, including the factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports. We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof or as of the date of issuance by a third party, as the case may be.
  • 3. InfiniFlash™ System The First All-Flash Storage System Built for High Performance Ceph
  • 4. Designed for Big Data Workloads @ PB Scale  Mixed media container, active- archiving, backup, locality of data  Large containers with application SLAs  Internet of Things, Sensor Analytics  Time-to-Value and Time-to-Insight  Hadoop  NoSQL  Cassandra  MongoDB  High read intensive access from billions of edge devices  Hi-Def video driving even greater demand for capacity and performance  Surveillance systems, analytics CONTENT REPOSITORIES BIG DATA ANALYTICS MEDIA SERVICES
  • 5. Flash Card Performance**  Read Throughput > 400MB/s  Read IOPS > 20K IOPS  Random Read/Write @4K- 90/10 > 15K IOPS Flash Card Integration  Alerts and monitoring  Latching integrated and monitored  Integrated air temperature sampling InfiniFlash System Capacity 512TB* raw  All-Flash 3U Storage System  64 x 8TB Flash Cards with Pfail  8 SAS ports total Operational Efficiency and Resilient  Hot Swappable components, Easy FRU  Low power 450W(avg), 750W(active)  MTBF 1.5+ million hours Scalable Performance**  1M IOPS  6-9GB/s Throughput  Upgrade to 12-15GB/s 1.5+M IOPS in 11/15 * 1TB = 1,000,000,000,000 bytes. Actual user capacity less. ** Based on internal testing of InfiniFlash 100. Test report available.
  • 6. InfiniFlash IF500 All-Flash Storage System Block and Object Storage Powered by Ceph  Ultra-dense High Capacity Flash storage • 512TB in 3U, Scale-out software for PB scale capacity  Highly scalable performance • Industry leading IOPS/TB  Cinder, Glance and Swift storage • Add/remove server & capacity on-demand  Enterprise-Class storage features • Automatic rebalancing • Hot Software upgrade • Snapshots, replication, thin provisioning • Fully hot swappable, redundant  Ceph Optimized for SanDisk flash • Tuned & Hardened for InfiniFlash
  • 7. 7  Began in summer of ‘13 with the Ceph Dumpling release  Ceph optimized for HDD • Tuning AND algorithm changes needed for Flash optimization • Leave defaults for HDD  Quickly determined that the OSD was the major bottleneck • OSD maxed out at about 1000 IOPS on fastest CPUs (using ~4.5 cores)  Examined and rejected multiple OSDs per SSD • Failure Domain / Crush rules would be a nightmare Optimizing Ceph for the all-flash Future
  • 8. 8  Context switches matter at flash rates • Too much “put it in a queue for a another thread” • Too much lock contention  Socket handling matters too! • Too many “get 1 byte” calls to the kernel for sockets • Disable Nagle’s algorithm to shorten operation latency  Lots of other simple things • Eliminate repeated look-ups in maps, caches, etc. • Eliminate Redundant string copies (especially return of string) • Large variables passed by value, not const reference  Contributed improvements to Emperor, Firefly and Giant releases  Now obtain about >80K IOPS / OSD using around 9 CPU cores/OSD (Hammer) * SanDisk: OSD Read path Optimization * Internal testing normalized from 3 OSDs / 132GB DRAM / 8 Clients / 2.2 GHz XEON 2x8 Cores / Optimus Max SSDs
  • 9. Test Configuration – Single InfiniFlash System
  • 10. Test Configuration – Single InfiniFlash System Performance Config - InfiniFlash with 2 storage controller configuration 2 Node Cluster ( 32 drives shared to each OSD node) Node 2 Servers (Dell R720) 2x E5-2680 8C 2.8GHz 25M$ 4x 16GB RDIMM, dual rank x4 (64GB) 1x Mellanox X3 Dual 40GbE 1x LSI 9207 HBA card RBD Client 4 Servers (Dell R620) 2 x E5-2680 10C 2.8GHz 25M$ 2 x 16GB RDIMM, dual rank x4 (32 GB) 1x Mellanox X3 Dual 40GbE Storage - InfiniFlash with 512TB with 2 OSD servers InfiniFlash 1-InfiniFlash is connected 64 x 1YX2 Icechips in A2 topology. Total storage - 64 * 8 tb = 512tb ( effective 430 TB ) InfiniFlash- fw details FFU 1.0.0.31.1 Network Details 40G Switch NA OS Details OS Ubuntu 14.04 LTS 64bit 3.13.0-32 LSI card/ driver SAS2308(9207) mpt2sas Mellanox 40gbps nw card MT27500 [ConnectX-3] mlx4_en - 2.2-1 (Feb 2014) Cluster Configuration CEPH Version sndk-ifos-1.0.0.04 0.86.rc.eap2 Replication (Default) 2 [Host] Note: - Host level replication. Number of Pools, PGs & RBDs pool = 4 ;PG = 2048 per pool 2 RBDs from each pool RBD size 2TB Number of Monitors 1 Number of OSD Nodes 2 Number of OSDs per Node 32 total OSDs = 32 * 2 = 64
  • 11. Performance Improvement: Stock Ceph vs IF OS 8K Random Blocks Read Performance improves 3x to 12x depending on the Block size Top Row: Queue Depth Bottom Row: % Read IOs IOPS Avglatenv(ms) Avg Latency 0 50000 100000 150000 200000 250000 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Stock Ceph (Giant) IFOS 1.0 0 20 40 60 80 100 120 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 • 2 RBD/Client x Total 4 Clients • 1 InfiniFlash node with 512TB IOPS Top Row: Queue Depth Bottom Row: % Read IOs
  • 12. Test Configuration – 3 InfiniFlash Systems (128TB each)
  • 13. InfiniFlash Performance Advantage 900K Random Read Performance with 384TB of storage Flash Performance unleashed • Out-of-the Box configurations tuned for performance with Flash • Read & Write data-path changes for Flash • x3-12 block performance improvement – depending on workload • Almost linear performance scale with addition of InfiniFlash nodes • Write performance WIP with NV-RAM Journals• Measured with 3 InfiniFlash nodes with 128TB each • Avg Latency with 4K Block is ~2ms, with 99.9 percentile latency is under 10ms • For Lower block size, performance is CPU bound at Storage Node. • Maximum Bandwidth of 12.2GB/s measured towards 64KB blocks S 13
  • 14. 14  Write path strategy was classic HDD • Journal writes for minimum foreground latency • Process journal in batches in the background  The batch oriented processing was very inefficient on flash  Modified buffering/writing strategy for Flash • Recently committed to Jewel release • Yields 2.5x write throughput improvement over Hammer • Average latency is ½ of Hammer SanDisk: OSD Write path Optimization
  • 15. 15  RDMA intra-cluster communication • Significant reduction in CPU / IOP  NewStore • Significant reduction in write amplification -> even higher write performance  Memory allocation • tcmalloc/jemalloc/AsyncMessenger tuning shows up to 3x IOPS vs. default * * https://drive.google.com/file/d/0B2gTBZrkrnpZY3U3TUU3RkJVeVk/view SanDisk: Potential Future Improvements
  • 16. Enhancing CEPH for Enterprise Consumption Open Source Ceph + SanDisk Performance patches  Out-of-the Box configurations tuned for performance with Flash  Sizing & Planning Tool  Higher node resiliency with Multi-Path support  Persistent reservations of drives to nodes  Ceph Installer that is specifically built for InfiniFlash  High Performance iSCSI Storage  Better Diagnostics with Log Collection Tool  Enterprise hardened QA @scale  InfiniFlash Drive Management integrated into Ceph Management (Coming Soon) SanDisk adds Usability & Performance utilities without sacrificing Open Source Principles IFOS = Community Ceph Distribution + Utilities All Ceph Performance improvements developed by SanDisk are contributed back to community
  • 17. Open Source with SanDisk Advantage InfiniFlash OS – Enterprise Level Hardened Ceph  Innovation and speed of Open Source with the trustworthiness of Enterprise grade and Web-Scale testing, hardware optimization  Performance optimization for flash and hardware tuning  Hardened and tested for Hyperscale deployments and workloads  Enterprise class support and services from SanDisk  Risk mitigation through long term support and a reliable long term roadmap  Continual contribution back to the community Enterprise Level Hardening Testing at Hyperscale Failure Testing  9,000 hours of cumulative IO tests  1,100+ unique test cases  1,000 hours of Cluster Rebalancing tests  1,000 hours of IO on iSCSI  Over 100 server node clusters  Over 4PB of Flash Storage  2,000 Cycle Node Reboot  1,000 times Node Abrupt Power Cycle  1,000 times Storage Failure  1,000 times Network Failure  IO for 250 hours at a stretch
  • 18. InfiniFlash for OpenStack with Dis-Aggregation  Compute & Storage Disaggregation enables Optimal Resource utilization  Allows for more CPU usage required for OSDs with small Block workloads  Allows for higher bandwidth provisioning as required for large Object workload  Independent Scaling of Compute and Storage  Higher Storage capacity needs doesn't’t force you to add more compute and vice-versa  Leads to optimal ROI for PB scale OpenStack deploymentsHSEB A HSEB B OSDs SAS …. HSEB A HSEB B HSEB A HSEB B …. ComputeFarm LUN LUN iSCSI Storage …Obj Obj Swift ObjectStore …LUN LUN Nova with Cinder & Glance … LibRBD QEMU/KVM RGW WebServer KRBD iSCSI Target OSDs OSDs OSDs OSDs OSDs StorageFarm Confidential – EMS Product Management
  • 19. Flash + HDD with Data Tier-ing Flash Performance with TCO of HDD  InfiniFlash OS performs automatic data placement and data movement between tiers based transparent to Applications  User defined Policies for data placement on tiers  Can be used with Erasure coding to further reduce the TCO Benefits  Flash based performance with HDD like TCO  Lower performance requirements on HDD tier enables use of denser and cheaper SMR drives  Denser and lower power compared to HDD only solution  InfiniFlash for High Activity data and SMR drives for Low activity data  60+ HDD per Server Compute Farm
  • 20. Flash Primary + HDD Replicas Flash Performance with TCO of HDD Primary replica on InfiniFlash HDD based data node for 2nd local replica HDD based data node for 3rd DR replica  Higher Affinity of the Primary Replica ensures much of the compute is on InfiniFlash Data  2nd and 3rd replicas on HDDs are primarily for data protection  High throughput of InfiniFlash provides data protection, movement for all replicas without impacting application IO  Eliminates cascade data propagation requirement for HDD replicas  Flash-based accelerated Object performance for Replica 1 allows for denser and cheaper SMR HDDs for Replica 2 and 3 Compute Farm
  • 21. InfiniFlash TCO Advantage $- $10,000,000 $20,000,000 $30,000,000 $40,000,000 $50,000,000 $60,000,000 $70,000,000 $80,000,000 Tradtional ObjStore on HDD IF500 ObjStore w/ 3 Full Replicas on Flash IF500 w/ EC - All Flash IF500 - Flash Primary & HDD Copies 3 year TCO comparison * 3 year Opex TCA 0 20 40 60 80 100 Tradtional ObjStore on HDD IF500 ObjStore w/ 3 Full Replicas on Flash IF500 w/ EC - All Flash IF500 - Flash Primary & HDD Copies Total Rack  Reduce the replica count with higher reliability of flash - 2 copies on InfiniFlash vs. 3 copies on HDD  InfiniFlash disaggregated architecture reduces compute usage, thereby reducing HW & SW costs - Flash allows the use of erasure coded storage pool without performance limitations - Protection equivalent of 2x storage with only 1.2x storage  Power, real estate, maintenance cost savings over 5 year TCO * TCO analysis based on a US customer’s OPEX & Cost data for a 100PB deployment 21
  • 22. 22 Thank You! @BigDataFlash #bigdataflash ©2015 SanDisk Corporation. All rights reserved. SanDisk is a trademark of SanDisk Corporation, registered in the United States and other countries. InfiniFlash is a trademarks of SanDisk Enterprise IP LLC. All other product and company names are used for identification purposes and may be trademarks of their respective holder(s).

Hinweis der Redaktion

  1. Video continues to drive the need for storage, and Point-Of-View cameras like GoPro are producing compelling high resolution videos on our performance cards. People using smartphones to make high resolution videos choose our performance mobile cards also, driving the need for higher capacities. There is a growing customer base for us around the world, with one billion additional people joining the Global Middle Class between 2013 and 2020. These people will use smart mobile devices as their first choice to spend discretionary income on, and will expand their storage using removable cards and USB drives. We are not standing still, but creating new product categories to allow people to expand and share their most cherished memories. ___________________________________________________________
  2. X2 is 2 bits per cell, X3 is 3 bits per cell
  3. Performance: Shorter jobs by 4x per study, flash enablement) Share Compute with other infrastructure (win for any company with seasonality). Flexible & Elastic Storage Platform to handle MapReduce load spikes