SlideShare a Scribd company logo
1 of 24
Download to read offline
Yuming Ma and Seth Mason
Cisco
March 30, 2016
Or “How to Get your Executives to Forget Who You Are”
Stabilizing Ceph
2© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Highlights
1.  What are we doing with Ceph?
2.  What did we start with?
3.  We’re Gonna Need a Bigger Boat
4.  Getting Better and sleeping through the night
5.  Lessons learned
3© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Cisco Cloud Services provides an Openstack platform to Cisco SaaS
Applications and tenants through a worldwide deployment of
Datacenters.
Background
SaaS Cases
•  Collaboration
•  IoT
•  Security
•  Analytics
•  “Unknown
Projects”
Swift
•  Database (Trove)
Backups
•  Static Content
•  Cold/Offline data for
Hadoop
Cinder
•  Generic/Magnetic
Volumes
•  Low Performance
4© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
•  Boot Volumes for all VM flavors except those with
Ephemeral (local) storage
•  Glance Image store
•  Generic Cinder Volume
•  Swift Object store
•  In production since March 2014
•  13 clusters in production in two years
•  Each cluster is 1800TB raw over 45 nodes and 450
OSDs.
How Do We Use Ceph?
Cisco UCS
Ceph
High-Perf
Platform
Generic
Volume
Prov
IOPS
Cinder API
Object
Swift API
5© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
•  Nice consistent
growth…
•  Your users will not
warn you before:
•  “going live”
•  Migrating out of S3
•  Backing up a hadoop
cluster
Growth: It will happen, just not sure when
6© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
CCS Ceph 1.0
RACK3RACK2
1 2 10
LSI 9271 HBA
Data
partition
HDD
Journal
partition
…..
…..
XFS
…..
…..
OSD2 OSD10OSD1
…..
1211
OS on
RAID1
MIRROR
2x10Gb PRIVATE NETWORK
KEYSTONE
API
SWIFT
API
CINDER
API
GLANCE
API
NOVA API
OPENSTACK
RADOS GATE WAY CEPH BLOCK DEVICE (RBD)
Libvirt/
kvm
2x10Gb PUBLIC NETWORK
monitors monitors monitors
15xC240
CEPH libRADOS API
RACK1
15xC240 15xC240
OSD: 45 x UCS C240 M3
•  2xE5 2690 V2, 40 HT/core
•  64GB RAM
•  2x10Gbs for public
•  2x10Gbs for cluster
•  3X replication
•  LSI 9271 HBA
•  10 x 4TB HDD, 7200 RPM
•  10GB journal partition from
HDD
•  RHEL3.10.0-229.1.2.el7.x86_
64
NOVA: UCS C220
•  Ceph 0.94.1
•  RHEL3.10.0-229.4.2.el7.x86_
64
MON/RGW: UCS C220 M3
•  2xE5 2680 V2, 40 HT/core
•  64GB RAM
•  2x10Gbs for public
•  4x3TB HDD, 7200 RPM
•  RHEL3.10.0-229.4.2.el7.x86_
64
Started with Cuttlefish/Dumpling
7© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
•  Get to MVP and keep costs down.
•  High capacity, hence C240 M3 LFF for 4TB HDDs
•  Tradeoff was that C240 M3 LFF could not also accommodate SSD L
•  So Journal was collocated on OSD
•  Monitors were on HDD based systems as well
Initial Design Considerations
8© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Major Stability Problems: Monitors
Problem Impact
MON election storm
impacting client IO
Monmap changes due to flaky NIC or chatty messaging between MON and
client. Caused unstable quorum and an election storm between MON hosts
Results: blocked and slowed client IO requests
LevelDB inflation Level DB size grows to XXGB over time that prevents MON daemon from
serving OSD requests
Results: Blocked IO and slow request
DDOS due to chatty
client msg attack
Slow response from MON to client due to levelDB or election storm causing
message flood attack from client.
Results: failed client operation, e.g volume creation, RBD connection
9© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Major Stability Problems: Cluster
Problem Impact
Backfill & Recovery
impacting client IO
Osdmap changes due to loss of disk, resulting in PG peering and backfilling
Results: Clients receive blocked and slow IO.
Unbalanced data
distribution
Data on OSDs isn’t evenly distributed. Cluster may be 50% full, but some
OSDs are at 90%
Results: Backfill isn’t always able to complete.
Slow disk impacting
client IO
A single slow (sick, not dead) OSD can severely impact many clients until it’s
ejected from the cluster.
Results: Client have slow or blocked IO.
10© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Improvement Strategy
Strategy Improvement
Client IO throttling* Rate limit IOPS at Nova host to 250 IOPS per volume.
Backfill and recovery
throttling
Reduced IO consumption by backfill and recovery processes to yield to
client IO over
Retrofit with NVME (PCIe)
journals
Increased overall IOPS of the cluster
Upgrade to 1.2.3/1.3.2 Overall stability and hardened MONs preventing election storm
LevelDB on SSD
(replaced entire mon node)
Faster cluster map query
Re-weight by utilization Balance data distribution
*Client is the RBD client not the tenant
11© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
•  Limit map/cap IO consumption at
qemu layer:
•  iops ( IOPS read and write ) 250
•  bps (Bits per second read and
write ) 100 MB/s
•  Predictable and controlled IOPS
capacity
•  NO min/guaranteed IOPS ->
future Ceph feature
•  NO burst map -> qemu feature:
•  iops_max 500
•  bpx_max 120 MB/s
Client IO throttling
Swing ~ 100%
Swing ~ 12%
12© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
•  Problem
•  Blocked IO during peering
•  Slow requests during backfill
•  Both could cause client IO stall
and soft lockup
•  Solution
•  Throttling backfill and recovery
osd	
  recovery	
  max	
  ac-ve	
  =	
  3	
  (default	
  :	
  15)	
  
osd	
  recovery	
  op	
  priority	
  =	
  3	
  (default	
  :	
  10)	
  
osd	
  max	
  backfills	
  =	
  1	
  (default	
  :	
  10)	
  
Backfill and Recovery Throttling
13© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Rack-1
6 nodes
osd
1
osd
10…
Rack-2
5 nodes
osd
1
osd
10…
Rack-3
6 nodes
osd
1
osd
10…
nova1
vm
1
vm
20
…
nova10
vm
1
vm
20
…
nova2
vm
1
…
…..
NVMe Journaling: Performance Testing Setup
Partition starts at 4MB(s1024), 10GB each and
4MB offset in between
1 2 10
LSI 9271 HBA
1 2 10
RAID0
1DISK
1 2 10NVME
…..
…..
XFS
…..
…..
OSD
2
OSD10OSD
1
…..
300GB Free
1211
OS on RAID1
MIRROR OSD: C240 M3
•  2xE5 2690 V2, 40 HT/core
•  64GB RAM
•  2x10Gbs for public
•  2x10Gbs for cluster
•  3X replication
•  Intel P3700 400GB NVMe
•  LSI 9271 HBA
•  10x4TB, 7200 RPM
Nova C220
•  2xE5 2680 V2, 40 HT/core
•  380GB RAM
•  2x10Gbs for public
•  3.10.0-229.4.2.el7.x86_64
vm
20
14© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
NVMe Journaling: Performance Tuning
OSD host iostat:
•  Both nvme and hdd disk %util and low most of the time, and spikes
every ~45s.
•  Both nvme and hdd have very low queue size (iodepth) while frontend
VM pushes 16 qdepth to FIO.
•  CPU %used is reasonable, converge at <%30. But the iowait is low
which corresponding to low disk activity
15© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
NVMe Journaling: Performance Tuning
Tuning Directions: increase disk %util:
•  Disk thread: 4, 16, 32
•  Filestore max sync interval: (0.1, 0.2, 0.5, 1 5, 10 20)
16© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
•  These two tunings showed no impact:
filestore_wbthrottle_xfs_ios_start_flusher: default 500 vs 10
filestore_wbthrottle_xfs_inodes_start_flusher: default 500 vs 10
•  Final Config:
osd_journal_size	
  =	
  10240	
  (default	
  :	
  	
  
journal_max_write_entries=	
  1000	
  (default	
  :	
  100)	
  
journal_max_write_bytes=1048576000	
  (default	
  :10485760)	
  
journal_queue_max_bytes=1048576000	
  (default	
  :10485760)	
  
filestore_queue_max_bytes=1048576000	
  ((default	
  :10485760)	
  
filestore_queue_commiJng_max_bytes=1048576000	
  ((default	
  :10485760)	
  
filestore_wbthroMle_xfs_bytes_start_flusher	
  =	
  4194304	
  ((default	
  	
  :10485760)
NVMe Performance Tuning
Linear tuning filestore_wbthrottle_xfs_bytes_start_flusher:
filestore_wbthrottle_xfs_inodes_start_flusher
filestore_wbthrottle_xfs_ios_start_flusher
17© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
NVMe Stability Improvement Analysis
Journal One Disk (70% of
3TB) failure MTTR
One Host (70% of
30TB) Failure MTTR
Colo 11 hrs, 7 mins. 6 secs 19 hrs, 2 mins, 2 secs
NVME 1 hr, 35 mins, 20 secs 16 hr, 46 mins, 3 secs
	
Journals on Disk failure (70% of
3TB) impact to client
IOPS
Host failure (70% of
30TB) impact to client
IOPS
Colo 232.991 vs 210.08
(Drop: 9.83%)
NVME 231.66 vs 194.13
(Drop: 16.20%)
231.66 vs 211.36 (Drop:
8.76%)
	
Backfill	
  and	
  recovery	
  config:	
  
osd	
  recovery	
  max	
  ac-ve	
  =	
  3	
  (default	
  :	
  15)	
  
osd	
  max	
  backfills	
  =	
  1	
  (default	
  :	
  10)	
  
osd	
  recovery	
  op	
  priority	
  =	
  3	
  (default	
  :	
  10)	
  
	
  
Server	
  impact:	
  
•  Shorter	
  recovery	
  -me	
  
	
  
Client	
  impact	
  	
  
•  <10%	
  impact	
  (tested	
  without	
  IO	
  throMling,	
  
should	
  be	
  less	
  with	
  throMling)
18© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
LevelDB:
•  Key-value store for cluster metadata,
e.g. osdmap, pgmap, monmap,
clientID, authID etc
•  Not in data path
•  Impactful to IO operation: IO blocked
by the DB query
•  Larger size, longer query time, hence
longer IO wait -> slow requests
New BOM:
•  UCS C220 M4 with 120GB SSD
MON Level DB on SSD
19© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
•  Problem
•  Election Storm & LevelDB inflation
•  Solutions
•  Upgrade to 1.2.3 to fix election storm
•  Upgrade to 1.3.2 to fix levelDB inflation
•  Configuration change
MON Cluster Hardening
[mon]
mon_lease = 20 (default = 5)
mon_lease_renew_interval = 12 (default 3)
mon_lease_ack_timeout = 40 (default 10)
mon_accept_timeout = 40 (default 10)
[client]
mon_client_hunt_interval = 40 (defaiult 3)
20© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
•  Problem
•  High skew of %used of disks that is
preventing data intake even cluster capacity
allows
•  Impact:
•  Unbalanced PG distribution impacts
performance
•  Rebalancing is impactful as well
•  Solution: TBD
•  Upgrade to Hammer 1.3.2+patch
•  Re-weight by utilization: >10% delta
Data Distribution and Balance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
23
45
67
89
111
133
155
177
199
221
243
265
287
309
331
353
375
397
419
441
us-internal-1 disk % used
% used
Cluster: 67.9% full
OSDs:
•  Min: 47.2&
•  Max: 83.5%
•  Mean: %69.6
•  Stddev: 6.5
21© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
•  Problem
•  RBD image data distributed to
all disk and single slow disk
can impact critical data IO
•  Solution: proactively detect
slow disks and mark OSD out
•  Seagate Cloud-Gazer: in
Progress
•  Other research projects/
publication
Proactive Detection of Slow Disks
22© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
•  Set Clear Stability Goals
•  You can plan for everything except how tenants will use it
•  Monitor Everything, but not “everything”
•  Turn down logging… It is possible to send 900k logs in 30minutes
•  Look for issues in services that consume storage
•  Had 50TB of “deleted volumes” that weren’t
Lessons Learned
23© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
•  DevOps
•  It’s not just technology, it’s how your team operates as a team
•  Share knowledge
•  Manage your backlog and manage your management
•  Consistent performance and stability modeling
•  Rigorous testing
•  Determine Requires and architect to them
•  Balance performance, cost and time
•  Automate builds and rebuilds
•  Shortcuts create Technical Debt
Last Lesson…
Yuming Ma: yumima@cisco.com
Thank You
Seth Mason: setmason@cisco.com

More Related Content

What's hot

Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Community
 
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoCCeph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoCCeph Community
 
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Community
 
Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Community
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 
Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCCeph Community
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Community
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Community
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Community
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Community
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Patrick McGarry
 
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server Ceph Community
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitecturePatrick McGarry
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureDanielle Womboldt
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudPatrick McGarry
 
Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph Community
 
Simplifying Ceph Management with Virtual Storage Manager (VSM)
Simplifying Ceph Management with Virtual Storage Manager (VSM)Simplifying Ceph Management with Virtual Storage Manager (VSM)
Simplifying Ceph Management with Virtual Storage Manager (VSM)Ceph Community
 

What's hot (20)

Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
 
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoCCeph Day Melbourne - Walk Through a Software Defined Everything PoC
Ceph Day Melbourne - Walk Through a Software Defined Everything PoC
 
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache TieringCeph Day Shanghai - Recovery Erasure Coding and Cache Tiering
Ceph Day Shanghai - Recovery Erasure Coding and Cache Tiering
 
Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoC
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
Ceph Day Shanghai - CeTune - Benchmarking and tuning your Ceph cluster
 
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons LearnedCeph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
Ceph Day Chicago - Ceph Deployment at Target: Best Practices and Lessons Learned
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data Ceph Day San Jose - Object Storage for Big Data
Ceph Day San Jose - Object Storage for Big Data
 
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
Using Recently Published Ceph Reference Architectures to Select Your Ceph Con...
 
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
Ceph Day San Jose - All-Flahs Ceph on NUMA-Balanced Server
 
QCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference ArchitectureQCT Ceph Solution - Design Consideration and Reference Architecture
QCT Ceph Solution - Design Consideration and Reference Architecture
 
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA ArchitectureCeph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
Ceph Day Beijing - Ceph All-Flash Array Design Based on NUMA Architecture
 
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale Ceph: Low Fail Go Scale
Ceph: Low Fail Go Scale
 
Simplifying Ceph Management with Virtual Storage Manager (VSM)
Simplifying Ceph Management with Virtual Storage Manager (VSM)Simplifying Ceph Management with Virtual Storage Manager (VSM)
Simplifying Ceph Management with Virtual Storage Manager (VSM)
 

Similar to Stabilizing Ceph

Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudCeph Community
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Community
 
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...ervogler
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red_Hat_Storage
 
Azure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
Azure VM 101 - HomeGen by CloudGen Verona - Marco ObinuAzure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
Azure VM 101 - HomeGen by CloudGen Verona - Marco ObinuMarco Obinu
 
Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage Ceph Community
 
Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage Ceph Community
 
Microsofts Configurable Cloud
Microsofts Configurable CloudMicrosofts Configurable Cloud
Microsofts Configurable CloudChris Genazzio
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Community
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationBigstep
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBXiao Yan Li
 
Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationJeff Squyres
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph clusterMirantis
 
How Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterHow Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterAaron Joue
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Community
 

Similar to Stabilizing Ceph (20)

Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack CloudJourney to Stability: Petabyte Ceph Cluster in OpenStack Cloud
Journey to Stability: Petabyte Ceph Cluster in OpenStack Cloud
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash Storage
 
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: Cisco UCS For Big Dat...
 
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
Red Hat Storage Day Seattle: Stabilizing Petabyte Ceph Cluster in OpenStack C...
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
ceph-barcelona-v-1.2
ceph-barcelona-v-1.2ceph-barcelona-v-1.2
ceph-barcelona-v-1.2
 
Ceph barcelona-v-1.2
Ceph barcelona-v-1.2Ceph barcelona-v-1.2
Ceph barcelona-v-1.2
 
Azure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
Azure VM 101 - HomeGen by CloudGen Verona - Marco ObinuAzure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
Azure VM 101 - HomeGen by CloudGen Verona - Marco Obinu
 
Corralling Big Data at TACC
Corralling Big Data at TACCCorralling Big Data at TACC
Corralling Big Data at TACC
 
Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage Ceph Day Taipei - Ceph on All-Flash Storage
Ceph Day Taipei - Ceph on All-Flash Storage
 
Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage Ceph Day Seoul - Ceph on All-Flash Storage
Ceph Day Seoul - Ceph on All-Flash Storage
 
Microsofts Configurable Cloud
Microsofts Configurable CloudMicrosofts Configurable Cloud
Microsofts Configurable Cloud
 
Tacc Infinite Memory Engine
Tacc Infinite Memory EngineTacc Infinite Memory Engine
Tacc Infinite Memory Engine
 
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architectureCeph Day Beijing - Ceph all-flash array design based on NUMA architecture
Ceph Day Beijing - Ceph all-flash array design based on NUMA architecture
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
Some analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDBSome analysis of BlueStore and RocksDB
Some analysis of BlueStore and RocksDB
 
Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentation
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
How Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver ClusterHow Ceph performs on ARM Microserver Cluster
How Ceph performs on ARM Microserver Cluster
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
 

Recently uploaded

Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 

Recently uploaded (20)

Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 

Stabilizing Ceph

  • 1. Yuming Ma and Seth Mason Cisco March 30, 2016 Or “How to Get your Executives to Forget Who You Are” Stabilizing Ceph
  • 2. 2© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Highlights 1.  What are we doing with Ceph? 2.  What did we start with? 3.  We’re Gonna Need a Bigger Boat 4.  Getting Better and sleeping through the night 5.  Lessons learned
  • 3. 3© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Cisco Cloud Services provides an Openstack platform to Cisco SaaS Applications and tenants through a worldwide deployment of Datacenters. Background SaaS Cases •  Collaboration •  IoT •  Security •  Analytics •  “Unknown Projects” Swift •  Database (Trove) Backups •  Static Content •  Cold/Offline data for Hadoop Cinder •  Generic/Magnetic Volumes •  Low Performance
  • 4. 4© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential •  Boot Volumes for all VM flavors except those with Ephemeral (local) storage •  Glance Image store •  Generic Cinder Volume •  Swift Object store •  In production since March 2014 •  13 clusters in production in two years •  Each cluster is 1800TB raw over 45 nodes and 450 OSDs. How Do We Use Ceph? Cisco UCS Ceph High-Perf Platform Generic Volume Prov IOPS Cinder API Object Swift API
  • 5. 5© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential •  Nice consistent growth… •  Your users will not warn you before: •  “going live” •  Migrating out of S3 •  Backing up a hadoop cluster Growth: It will happen, just not sure when
  • 6. 6© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential CCS Ceph 1.0 RACK3RACK2 1 2 10 LSI 9271 HBA Data partition HDD Journal partition ….. ….. XFS ….. ….. OSD2 OSD10OSD1 ….. 1211 OS on RAID1 MIRROR 2x10Gb PRIVATE NETWORK KEYSTONE API SWIFT API CINDER API GLANCE API NOVA API OPENSTACK RADOS GATE WAY CEPH BLOCK DEVICE (RBD) Libvirt/ kvm 2x10Gb PUBLIC NETWORK monitors monitors monitors 15xC240 CEPH libRADOS API RACK1 15xC240 15xC240 OSD: 45 x UCS C240 M3 •  2xE5 2690 V2, 40 HT/core •  64GB RAM •  2x10Gbs for public •  2x10Gbs for cluster •  3X replication •  LSI 9271 HBA •  10 x 4TB HDD, 7200 RPM •  10GB journal partition from HDD •  RHEL3.10.0-229.1.2.el7.x86_ 64 NOVA: UCS C220 •  Ceph 0.94.1 •  RHEL3.10.0-229.4.2.el7.x86_ 64 MON/RGW: UCS C220 M3 •  2xE5 2680 V2, 40 HT/core •  64GB RAM •  2x10Gbs for public •  4x3TB HDD, 7200 RPM •  RHEL3.10.0-229.4.2.el7.x86_ 64 Started with Cuttlefish/Dumpling
  • 7. 7© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential •  Get to MVP and keep costs down. •  High capacity, hence C240 M3 LFF for 4TB HDDs •  Tradeoff was that C240 M3 LFF could not also accommodate SSD L •  So Journal was collocated on OSD •  Monitors were on HDD based systems as well Initial Design Considerations
  • 8. 8© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Major Stability Problems: Monitors Problem Impact MON election storm impacting client IO Monmap changes due to flaky NIC or chatty messaging between MON and client. Caused unstable quorum and an election storm between MON hosts Results: blocked and slowed client IO requests LevelDB inflation Level DB size grows to XXGB over time that prevents MON daemon from serving OSD requests Results: Blocked IO and slow request DDOS due to chatty client msg attack Slow response from MON to client due to levelDB or election storm causing message flood attack from client. Results: failed client operation, e.g volume creation, RBD connection
  • 9. 9© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Major Stability Problems: Cluster Problem Impact Backfill & Recovery impacting client IO Osdmap changes due to loss of disk, resulting in PG peering and backfilling Results: Clients receive blocked and slow IO. Unbalanced data distribution Data on OSDs isn’t evenly distributed. Cluster may be 50% full, but some OSDs are at 90% Results: Backfill isn’t always able to complete. Slow disk impacting client IO A single slow (sick, not dead) OSD can severely impact many clients until it’s ejected from the cluster. Results: Client have slow or blocked IO.
  • 10. 10© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Improvement Strategy Strategy Improvement Client IO throttling* Rate limit IOPS at Nova host to 250 IOPS per volume. Backfill and recovery throttling Reduced IO consumption by backfill and recovery processes to yield to client IO over Retrofit with NVME (PCIe) journals Increased overall IOPS of the cluster Upgrade to 1.2.3/1.3.2 Overall stability and hardened MONs preventing election storm LevelDB on SSD (replaced entire mon node) Faster cluster map query Re-weight by utilization Balance data distribution *Client is the RBD client not the tenant
  • 11. 11© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential •  Limit map/cap IO consumption at qemu layer: •  iops ( IOPS read and write ) 250 •  bps (Bits per second read and write ) 100 MB/s •  Predictable and controlled IOPS capacity •  NO min/guaranteed IOPS -> future Ceph feature •  NO burst map -> qemu feature: •  iops_max 500 •  bpx_max 120 MB/s Client IO throttling Swing ~ 100% Swing ~ 12%
  • 12. 12© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential •  Problem •  Blocked IO during peering •  Slow requests during backfill •  Both could cause client IO stall and soft lockup •  Solution •  Throttling backfill and recovery osd  recovery  max  ac-ve  =  3  (default  :  15)   osd  recovery  op  priority  =  3  (default  :  10)   osd  max  backfills  =  1  (default  :  10)   Backfill and Recovery Throttling
  • 13. 13© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Rack-1 6 nodes osd 1 osd 10… Rack-2 5 nodes osd 1 osd 10… Rack-3 6 nodes osd 1 osd 10… nova1 vm 1 vm 20 … nova10 vm 1 vm 20 … nova2 vm 1 … ….. NVMe Journaling: Performance Testing Setup Partition starts at 4MB(s1024), 10GB each and 4MB offset in between 1 2 10 LSI 9271 HBA 1 2 10 RAID0 1DISK 1 2 10NVME ….. ….. XFS ….. ….. OSD 2 OSD10OSD 1 ….. 300GB Free 1211 OS on RAID1 MIRROR OSD: C240 M3 •  2xE5 2690 V2, 40 HT/core •  64GB RAM •  2x10Gbs for public •  2x10Gbs for cluster •  3X replication •  Intel P3700 400GB NVMe •  LSI 9271 HBA •  10x4TB, 7200 RPM Nova C220 •  2xE5 2680 V2, 40 HT/core •  380GB RAM •  2x10Gbs for public •  3.10.0-229.4.2.el7.x86_64 vm 20
  • 14. 14© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential NVMe Journaling: Performance Tuning OSD host iostat: •  Both nvme and hdd disk %util and low most of the time, and spikes every ~45s. •  Both nvme and hdd have very low queue size (iodepth) while frontend VM pushes 16 qdepth to FIO. •  CPU %used is reasonable, converge at <%30. But the iowait is low which corresponding to low disk activity
  • 15. 15© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential NVMe Journaling: Performance Tuning Tuning Directions: increase disk %util: •  Disk thread: 4, 16, 32 •  Filestore max sync interval: (0.1, 0.2, 0.5, 1 5, 10 20)
  • 16. 16© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential •  These two tunings showed no impact: filestore_wbthrottle_xfs_ios_start_flusher: default 500 vs 10 filestore_wbthrottle_xfs_inodes_start_flusher: default 500 vs 10 •  Final Config: osd_journal_size  =  10240  (default  :     journal_max_write_entries=  1000  (default  :  100)   journal_max_write_bytes=1048576000  (default  :10485760)   journal_queue_max_bytes=1048576000  (default  :10485760)   filestore_queue_max_bytes=1048576000  ((default  :10485760)   filestore_queue_commiJng_max_bytes=1048576000  ((default  :10485760)   filestore_wbthroMle_xfs_bytes_start_flusher  =  4194304  ((default    :10485760) NVMe Performance Tuning Linear tuning filestore_wbthrottle_xfs_bytes_start_flusher: filestore_wbthrottle_xfs_inodes_start_flusher filestore_wbthrottle_xfs_ios_start_flusher
  • 17. 17© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential NVMe Stability Improvement Analysis Journal One Disk (70% of 3TB) failure MTTR One Host (70% of 30TB) Failure MTTR Colo 11 hrs, 7 mins. 6 secs 19 hrs, 2 mins, 2 secs NVME 1 hr, 35 mins, 20 secs 16 hr, 46 mins, 3 secs Journals on Disk failure (70% of 3TB) impact to client IOPS Host failure (70% of 30TB) impact to client IOPS Colo 232.991 vs 210.08 (Drop: 9.83%) NVME 231.66 vs 194.13 (Drop: 16.20%) 231.66 vs 211.36 (Drop: 8.76%) Backfill  and  recovery  config:   osd  recovery  max  ac-ve  =  3  (default  :  15)   osd  max  backfills  =  1  (default  :  10)   osd  recovery  op  priority  =  3  (default  :  10)     Server  impact:   •  Shorter  recovery  -me     Client  impact     •  <10%  impact  (tested  without  IO  throMling,   should  be  less  with  throMling)
  • 18. 18© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential LevelDB: •  Key-value store for cluster metadata, e.g. osdmap, pgmap, monmap, clientID, authID etc •  Not in data path •  Impactful to IO operation: IO blocked by the DB query •  Larger size, longer query time, hence longer IO wait -> slow requests New BOM: •  UCS C220 M4 with 120GB SSD MON Level DB on SSD
  • 19. 19© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential •  Problem •  Election Storm & LevelDB inflation •  Solutions •  Upgrade to 1.2.3 to fix election storm •  Upgrade to 1.3.2 to fix levelDB inflation •  Configuration change MON Cluster Hardening [mon] mon_lease = 20 (default = 5) mon_lease_renew_interval = 12 (default 3) mon_lease_ack_timeout = 40 (default 10) mon_accept_timeout = 40 (default 10) [client] mon_client_hunt_interval = 40 (defaiult 3)
  • 20. 20© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential •  Problem •  High skew of %used of disks that is preventing data intake even cluster capacity allows •  Impact: •  Unbalanced PG distribution impacts performance •  Rebalancing is impactful as well •  Solution: TBD •  Upgrade to Hammer 1.3.2+patch •  Re-weight by utilization: >10% delta Data Distribution and Balance 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 23 45 67 89 111 133 155 177 199 221 243 265 287 309 331 353 375 397 419 441 us-internal-1 disk % used % used Cluster: 67.9% full OSDs: •  Min: 47.2& •  Max: 83.5% •  Mean: %69.6 •  Stddev: 6.5
  • 21. 21© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential •  Problem •  RBD image data distributed to all disk and single slow disk can impact critical data IO •  Solution: proactively detect slow disks and mark OSD out •  Seagate Cloud-Gazer: in Progress •  Other research projects/ publication Proactive Detection of Slow Disks
  • 22. 22© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential •  Set Clear Stability Goals •  You can plan for everything except how tenants will use it •  Monitor Everything, but not “everything” •  Turn down logging… It is possible to send 900k logs in 30minutes •  Look for issues in services that consume storage •  Had 50TB of “deleted volumes” that weren’t Lessons Learned
  • 23. 23© 2015 Cisco and/or its affiliates. All rights reserved. Cisco Confidential •  DevOps •  It’s not just technology, it’s how your team operates as a team •  Share knowledge •  Manage your backlog and manage your management •  Consistent performance and stability modeling •  Rigorous testing •  Determine Requires and architect to them •  Balance performance, cost and time •  Automate builds and rebuilds •  Shortcuts create Technical Debt Last Lesson…
  • 24. Yuming Ma: yumima@cisco.com Thank You Seth Mason: setmason@cisco.com