The document discusses strategies for optimizing Ceph performance at scale. It describes the presenters' typical node configurations, including storage nodes with 72 HDDs and NVME journals, and monitor/RGW nodes. Various techniques are discussed like ensuring proper NUMA alignment of processes, IRQs, and mount points. General tuning tips include using latest drivers, OS tuning, and addressing network issues. The document stresses that monitors can become overloaded during large rebalances and deleting large pools, so more than one monitor is needed for large clusters.
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Designing High Performance Ceph at Scale
1. Designing for
High Performance Ceph at Scale
April 26, 2016
James Saint-Rossy - Principal Storage Engineer, Comcast
John Benton - Consulting Systems Engineer, WWT
2. Today’s Agenda
• Our Lab/Production Environment
• Holistic Architecture
• Strategies for Benchmarking
• Performance Bottlenecks/Lessons Learned
• Tuning Tips and Tricks
Designing for High Performance Ceph at Scale2
3. Our Typical Node Configuration
Storage Node
• 72 X 6 TB SATA 7.2K HDD’s
• 3 X 1.6TB PCIe NVME’s (Journals)
• 2 X Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (12 cores)
• 256 GB of RAM
• Dual Port 40Gbe NIC
Mon/RGW Node
• 2 x Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
• 32 GB of Ram
• Dual Port 10 Gbe NIC
• ...Nothing Special
3 Designing for High Performance Ceph at Scale
7. Strategies for Benchmarking
Tools
-Fio for block
-Cosbench for object
IOPS Isn’t Everything
-1000 workers may give you 30% more iops but at the
cost of 600% higher latency
Verify Published Stats With Benchmarks
-… Always
Verify Scale-Out
Designing for High Performance Ceph at Scale7
8. Performance - TCMalloc
• As cluster size increased, %SYS was increasingly taxed
• System profiling revealed up to 50% of CPU resources used by TCMalloc
• This library can be tuned to have more memory. This was good for nearly a
50% increase
Designing for High Performance Ceph at Scale8
11. OSD Data Workflow
11
"complicated situation" by bandinisonfire is licensed under CC BY-NC-SA 2.0
Designing for High Performance Ceph at Scale
12. Performance - NUMA
• The bigger and faster the data node, the bigger the
bottleneck potential
• We tuned several areas to avoid unnecessary trips
across the QPI bus
• To map everything you must:
• Map CPU cores to sockets
• Map PCIE devices to sockets
• Map storage disks (and journals) to the associated
HBA
Designing for High Performance Ceph at Scale12
13. NUMA - IRQs
Pin all soft IRQs for all IO devices to it’s associated NUMA
node
13 Designing for High Performance Ceph at Scale
14. NUMA - Mount Points
Align mount points so that the OSD and journal are on the
same NUMA node
14 Designing for High Performance Ceph at Scale
15. NUMA - OSD Processes
Pin OSD processes to the NUMA node associated with the
storage it controls
15 Designing for High Performance Ceph at Scale
16. Performance - General Tips
• Use latest vendor drivers.
-We have seen 30% improvements from stock drivers
• OS tuning focused on increasing threads, file handles,
etc.
• Jumbo frames help, particularly on the cluster network
• Flow control issues with 40Gbe network adapters
• Scan for failing (but perhaps not completely failed) disks
Designing for High Performance Ceph at Scale16
17. Designing for High Performance Ceph at Scale17
"Question" by alphageek is licensed under CC BY-NC-SA 2.0
19. Performance - Mons
• Mons are generally a glorified TFTP server and you can
get away with 1+2 for redundancy
• That is, until they aren’t….....
• In certain situations like cluster rebalancing or deleting a
pool with a lot of PG’s, a single CPU on *ALL* mons will
become jammed up. They start evicting each other and
meyhem ensues.
• How to fix this:
Presentation title (optional)19