SlideShare ist ein Scribd-Unternehmen logo
1 von 34
That's Ceph, I use Ceph now, Ceph is Cool.
Who's the crazy guy speaking?
What about Ceph?
RBD KO QEMU RBD RGW CephFS FUSE
librbd libcephfs
Ceph Storage Cluster Protocol (librados)
OSDs MonitorsOSDs MDSs
DISTRIBUTED EVERYTHING
CRUSH:
Hash Based
Deterministic Data Placement
Pseudo-Random, Weighted, Distribution
Hierarchically Defined Failure Domains
ADVANTAGES:
Avoids Centralized Data Lookups
Even Data Distribution
Healing is Distributed
Abstracted Storage Backends
CHALLENGES:
Ceph Loves Homogeneity (Per Pool)
Ceph Loves Concurrency
Data Integrity is Expensive
Data Movement is Unavoidable
Distributed Storage is Hard!
BORING!
How fast can we go?
Let's test something Fun!
Supermicro SC847A 36-drive Chassis
2x Intel XEON E5-2630L
4x LSI SAS9207-8i Controllers
24x 1TB 7200rpm spinning disks
8x Intel 520 SSDs
Bonded 10GbE Network
Total Cost: ~$12k
Write Read
0
500
1000
1500
2000
2500
Cuttlefish RADOS Bench 4M Object Throughput
4 Processes, 128 Concurrent Operations
BTRFS
EXT4
XFS
Throughput(MB/s)
Yeah, yeah, the bonded 10GbE network is maxed
out. Good for you Mark.
Who cares about RADOS Bench though?
I've moved to the cloud and do lots of small writes
on block storage.
OK, if Ceph is so awesome why are you only
testing 1 server? How does it scale?
Oak Ridge National Laboratory
4 Storage Servers, 8 Client Nodes
DDN SFA10K Storage Chassis
QDR Infiniband Everywhere
A Boatload of Drives!
1 2 3 4
0
2000
4000
6000
8000
10000
12000
14000
ORNL Multi-Server RADOS Bench Througput
4MB IOs, 8 Client Nodes
Writes
Reads
Writes (Including Journals)
Disk Fabric Max
Client Network Max
Server Nodes (11 OSDs Each)
Throughput(MB/s)
So RADOS is scaling nicely.
How much does data replication hurt us?
1 2 3
0
2000
4000
6000
8000
10000
12000
ORNL 4MB RADOS Bench Throughput
Write
Read
Total Write
(Including Journals)
Replication Level
Throughput(MB/s)
This is an HPC site. What about CephFS?
NOTE: CephFS is not production ready!
(Marketing and sales can now sleep again)
1 2 3 4 5 6 7 8
0
1000
2000
3000
4000
5000
6000
7000
ORNL 4M CephFS (IOR) Throughput
Max Write
Avg Write
Max Read
Avg Read
Client Nodes (8 Processes Each)
Throughput(MiB/s)
Hundreds of Cluster Configurations
Hundreds of Tunable Settings
Hundreds of Potential IO Patterns
Too Many Permutations to Test Everything!
When performance is bad, how do you diagnose?
Ceph Admin Socket
Collectl
Blktrace & Seekwatcher
perf
Where are we going from here?
More testing and Bug fixes!
Erasure Coding
Cloning from Journal Writes (BTRFS)
RSOCKETS/RDMA
Tiering
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFZoltan Arnold Nagy
 
Ceph Day Seoul - The Anatomy of Ceph I/O
Ceph Day Seoul - The Anatomy of Ceph I/OCeph Day Seoul - The Anatomy of Ceph I/O
Ceph Day Seoul - The Anatomy of Ceph I/OCeph Community
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big DataDataStax Academy
 
Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce Ceph Community
 
OSDC 2013 | Scale-Out made easy: Petabyte storage with Ceph by Martin Gerhard...
OSDC 2013 | Scale-Out made easy: Petabyte storage with Ceph by Martin Gerhard...OSDC 2013 | Scale-Out made easy: Petabyte storage with Ceph by Martin Gerhard...
OSDC 2013 | Scale-Out made easy: Petabyte storage with Ceph by Martin Gerhard...NETWAYS
 
Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State DrivesRick Branson
 
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksMarian Marinov
 
This is redis - feature and usecase
This is redis - feature and usecaseThis is redis - feature and usecase
This is redis - feature and usecaseKris Jeong
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph Community
 
What every data programmer needs to know about disks
What every data programmer needs to know about disksWhat every data programmer needs to know about disks
What every data programmer needs to know about disksiammutex
 
Ceph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance ArchiectureCeph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance ArchiectureCeph Community
 
Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Community
 
Ceph Day KL - Bluestore
Ceph Day KL - Bluestore Ceph Day KL - Bluestore
Ceph Day KL - Bluestore Ceph Community
 
A simple introduction to redis
A simple introduction to redisA simple introduction to redis
A simple introduction to redisZhichao Liang
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCeph Community
 

Was ist angesagt? (19)

Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
 
Ceph Day Seoul - The Anatomy of Ceph I/O
Ceph Day Seoul - The Anatomy of Ceph I/OCeph Day Seoul - The Anatomy of Ceph I/O
Ceph Day Seoul - The Anatomy of Ceph I/O
 
MySQL Head-to-Head
MySQL Head-to-HeadMySQL Head-to-Head
MySQL Head-to-Head
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
 
Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce Ceph Day San Jose - Ceph at Salesforce
Ceph Day San Jose - Ceph at Salesforce
 
OSDC 2013 | Scale-Out made easy: Petabyte storage with Ceph by Martin Gerhard...
OSDC 2013 | Scale-Out made easy: Petabyte storage with Ceph by Martin Gerhard...OSDC 2013 | Scale-Out made easy: Petabyte storage with Ceph by Martin Gerhard...
OSDC 2013 | Scale-Out made easy: Petabyte storage with Ceph by Martin Gerhard...
 
Cassandra and Solid State Drives
Cassandra and Solid State DrivesCassandra and Solid State Drives
Cassandra and Solid State Drives
 
Practical ZFS
Practical ZFSPractical ZFS
Practical ZFS
 
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networks
 
This is redis - feature and usecase
This is redis - feature and usecaseThis is redis - feature and usecase
This is redis - feature and usecase
 
Ceph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-GeneCeph on 64-bit ARM with X-Gene
Ceph on 64-bit ARM with X-Gene
 
Redis Persistence
Redis  PersistenceRedis  Persistence
Redis Persistence
 
What every data programmer needs to know about disks
What every data programmer needs to know about disksWhat every data programmer needs to know about disks
What every data programmer needs to know about disks
 
Ceph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance ArchiectureCeph Day KL - Ceph Tiering with High Performance Archiecture
Ceph Day KL - Ceph Tiering with High Performance Archiecture
 
Red Hat Gluster Storage
Red Hat Gluster StorageRed Hat Gluster Storage
Red Hat Gluster Storage
 
Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage Ceph Day KL - Ceph on All-Flash Storage
Ceph Day KL - Ceph on All-Flash Storage
 
Ceph Day KL - Bluestore
Ceph Day KL - Bluestore Ceph Day KL - Bluestore
Ceph Day KL - Bluestore
 
A simple introduction to redis
A simple introduction to redisA simple introduction to redis
A simple introduction to redis
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 

Andere mochten auch

Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Community
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSCeph Community
 
Ceph Performance and Optimization - Ceph Day Frankfurt
Ceph Performance and Optimization - Ceph Day Frankfurt Ceph Performance and Optimization - Ceph Day Frankfurt
Ceph Performance and Optimization - Ceph Day Frankfurt Ceph Community
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Community
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Community
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 

Andere mochten auch (6)

Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Ceph Performance and Optimization - Ceph Day Frankfurt
Ceph Performance and Optimization - Ceph Day Frankfurt Ceph Performance and Optimization - Ceph Day Frankfurt
Ceph Performance and Optimization - Ceph Day Frankfurt
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
 
Ceph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash StorageCeph Day Tokyo -- Ceph on All-Flash Storage
Ceph Day Tokyo -- Ceph on All-Flash Storage
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 

Ähnlich wie Ceph Day NYC: Ceph Performance & Benchmarking

[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화OpenStack Korea Community
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed_Hat_Storage
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
 
Red Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep DiveRed Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep DiveRed_Hat_Storage
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph clusterMirantis
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...In-Memory Computing Summit
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheDavid Grier
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfsNAVER D2
 
Storage and performance, Whiptail
Storage and performance, Whiptail Storage and performance, Whiptail
Storage and performance, Whiptail Internet World
 
Demystifying Storage
Demystifying  StorageDemystifying  Storage
Demystifying Storagebhavintu79
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSbigdatagurus_meetup
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
Demystifying Storage - Building large SANs
Demystifying  Storage - Building large SANsDemystifying  Storage - Building large SANs
Demystifying Storage - Building large SANsDirecti Group
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Community
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Виталий Стародубцев
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelColleen Corrice
 

Ähnlich wie Ceph Day NYC: Ceph Performance & Benchmarking (20)

[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Red Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep DiveRed Hat Storage Server Administration Deep Dive
Red Hat Storage Server Administration Deep Dive
 
Your 1st Ceph cluster
Your 1st Ceph clusterYour 1st Ceph cluster
Your 1st Ceph cluster
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
Storage and performance, Whiptail
Storage and performance, Whiptail Storage and performance, Whiptail
Storage and performance, Whiptail
 
Ceph
CephCeph
Ceph
 
Demystifying Storage
Demystifying  StorageDemystifying  Storage
Demystifying Storage
 
Quantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFSQuantcast File System (QFS) - Alternative to HDFS
Quantcast File System (QFS) - Alternative to HDFS
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
Demystifying Storage - Building large SANs
Demystifying  Storage - Building large SANsDemystifying  Storage - Building large SANs
Demystifying Storage - Building large SANs
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
The Smug Mug Tale
The Smug Mug TaleThe Smug Mug Tale
The Smug Mug Tale
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to Jewel
 

Kürzlich hochgeladen

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Kürzlich hochgeladen (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

Ceph Day NYC: Ceph Performance & Benchmarking

Hinweis der Redaktion

  1. Yes, that is a Cephalopod attacking a police box. Likeness to any existing objects, characters, or ideas is purely coincidental.
  2. My Name is Mark and I work for a company called Inktank making an open source distributed storage system called Ceph. Before I started working for Inktank, I worked for the Minnesota Supercomputing Institute. My job was to figure out how to make our clusters run as efficiently as possible. A lot of researchers ran code on the clusters that didn't make good use of the expensive network fabrics those systems have. We worked to find better alternatives for these folks and ended up prototyping a high performance openstack cloud for research computing. The one piece that was missing was the storage solution. That's how I discovered Ceph.
  3. I had heard of Ceph through my work for the Institute. The original development was funded by a high performance computing research grant at Lawrence Livermore National Laboratory. It had been in development since 2004, but it was only in around 2010 that I really started hearing about people starting to deploy storage with it. Ceph itself is an amazing piece of software. It lets you take commodity servers and turn them into a high performance, fault tolerant, distributed storage solution. It was designed to scale from the beginning and is made up from many distinct components.
  4. The primary building blocks of Ceph are the daemons that run on the nodes in the cluster. Ceph is composed of OSDs that store data and monitors that keep track of cluster health and state. When using ceph as a distributed POSIX filesystem (CephFS), metadata servers may also be used. On top of these daemons are various APIs. Librados is the lowest level API that can be used to interact with rados directly at the object level. Librbd and libcephfs provide file-like API access to RBD and CephFS respectively. Finally, we have the high level block, object, and filesystem interfaces that make ceph such a versatile storage solution.
  5. If you take one thing away from this talk, it should be that Ceph is designed to be distributed. Any number of services can run on any number of nodes. You can have as many storage servers as you want. Cluster monitors are distributed and use an election algorithm called PAXOS to avoid split-brain scenarios when servers fail. For S3 or swift compatible object storage, you can distribute requests across multiple gateways. When CephFS (Still Beta!) is used, the metadata servers can be distributed across multiple nodes and store metadata by distributing it across all of the OSD servers. When talking about data distribution specifically though, the crowning achievement in Ceph is CRUSH.
  6. In many distributed storage systems there is some kind of centralized server that maintains an allocation table of where data is stored in the cluster. Not only is this a single point of failure, but it also can become a bottleneck as clients need to query this server to find out where data should be written or read. CRUSH does away with this. It is a hash based algorithm that allows any client to algorithmically determine where in the cluster data should be placed based on its name. Better yet, data is distributed across OSDs in the cluster pseudo-randomly. CRUSH also provides other benefits, like the ability to hierarchically define failure domains to ensure that replicated data ends up on different hardware.
  7. From a performance perspective Ceph has a lot of benefits over many other distributed storage systems. There is no centralized server that can become a bottleneck for data allocation lookups. Data is well distributed across OSDs due to the psuedo-random nature of CRUSH. On traditional storage solutions a RAID array is used on each server. When a disk fails, the RAID array needs to be rebuilt which causes a hotspot in the cluster that drags performance down, and RAID rebuilds can last a long time. Because data in Ceph is pseudo-randomly distributed, healing happens cluster wide which dramatically speeds up the recovery process.
  8. One of the challenges in any distributed storage system is what happens when you have hotspots in the cluster. If any one server is slow to fulfill requests, they will start backing up. Eventually a limit is reached where all outstanding requests will be concentrated on the slower server(s) starving the other faster servers and potentially degrading overall cluster performance significantly. Likewise distributed storage systems in general need a lot of concurrency to keep all of the servers and disks constantly working. From a performance perspective, another challenge regarding Ceph specifically is that Ceph works really hard to ensure data integrity. It does a full write of the data for every journal commit, does crc32 checksums for every data transfer, and regularly does background scrubs.
  9. You guys have been patient so far but if you are anything like me then your attention span is starting to wear thin about now. Email or slashdot could be looking appealing. So let's switch things up a little bit. We've talked about why Ceph is conceptually so amazing, but what can it really deliver as far as performance goes? That was the question we asked about a year ago after seeing some rather lacklustre results on some of our existing internal test hardware. Our director of engineering told me to go forth and build a system that would give us some insight into how Ceph could perform on (what in my opinion would be) an ideal setup.
  10. One of the things that we noticed from our previous testing is that some systems are harder to get working well than others. One potential culprit appeared to be that some expander backplanes may not behave entirely properly. For this system, we decided to skip expanders entirely and directly connect each drive in the system to it's own controller SAS lane. That means that with 24 spinning disks and 8 SSDs we needed 4 dual-port controllers to connect all of the drives. With so many disks in this system, we'd need a lot of external network connectivity and opted for a bonded 10GbE setup. We only have a single client which could be a bottleneck, but at that point we were just hoping to break 1GB/s to start out with which seemed feasible. So where we able to do it?
  11. What you are looking at is a chart showing the write and read throughput on our test platform using our RADOS bench tool. This tool directly utilizes librados to write objects out as fast as possible and after writes complete, read them back. We're doing syncs and flushes between the tests on both the client and server and have measured the underlying disk throughput to make sure the results are accurate. What you are seeing here is that for writes, we not only hit 1GB/s, but are in fact hitting 2GB/s and maxing out the bonded 10GbE link. Reads aren't quite saturating the network but are coming pretty close. This is really good news because it means that with the right hardware, you can build Ceph nodes that can perform extremely well.
  12. I like to show the previous slide because it makes a big impact and I get to feel vindicated regarding spending a bunch of our startup money on new toys. And who doesn't like seeing a single server able to write out 2GB/s+ of data? The problem though is that reading and writing out 4MB objects directly via librados isn't necessarily a great representation of how Ceph will really perform once you layer block or S3 storage on top of it.
  13. Say for instance that you are using Ceph to provide block storage via RBD for openstack and have an application that does lots of 4K writes. Testing RADOS Bench with 4K objects might give you some rough idea of how RBD performs with small IOs, but it's also misleading. One of the things that you might not know is that RBD stores each block in a 4MB object behind the scenes. Doing 4K writes to 4MB objects results in different behavior that writing out distinct 4K objects themselves. Throw in the writeback cache implementation in QEMU RBD and things get complicated very fast. Ultimately you really do have to do the tests directly on RBD to know what's going on.
  14. ...And we've done that. In alarming detail. What you are seeing here is a comparison of sequential and random 4K writes using a really useful benchmarking tool called fio. We are testing Kernel RBD and QEMU RBD at differing concurrency and IO depth values. What you may notice is that the scaling behavior and throughput looks very different in each case. QEMU RBD with the writeback cache enabled is performing much better. Interestingly RBD cache not only helps sequential writes, but seems to help with random writes too (especially at low levels of concurrency). This is one example of the kind of testing we do, but we have thousands of graphs like this exploring different workloads and system configurations. It's a lot of work!
  15. We've gotten a lot of interesting results from our high performance test node and published quite a bit of those results in different articles on the web. Unfortunately we only have 1 of those nodes and our readers have started asking for are tests showing how performance scales across nodes. Do you remember what I said earlier about Ceph loving homogeneity and lots of concurrency? The truth is that the more consistently your hardware behaves, especially over time, the better your cluster is going to scale. Since I like to show good results instead of bad ones, lets take a look at an example of a cluster that's scaling really well.
  16. About a year and a half ago Oak Ridge National Laboratory (ORNL) reached out to Inktank to investigate how Ceph performs on a high performance storage system they have in their lab. This platform is a bit different than the typical platform that we deploy Ceph on. It has a ton (over 400!) of drives configured in RAID LUNs that are in chassis connected to the servers via QDR Infiniband links. As a result, the back-end storage maxes out at about 12GB/s but has pretty consistent performance characteristics since there are so many drives behind the scenes. Initially the results ORNL was seeing were quite bad. We worked together with them to find optimal hardware settings and did a lot of tuning and testing. By the time we were done performance had improved quite a bit.
  17. This chart shows performance on the ORNL cluster in it's final configuration using RADOS bench. Notice that throughput is scaling fairly linearly as we add storage nodes to the cluster. The blue and red lines represent write and read performance respectively. If you just look at the write performance, the results might seem disappointing. Ceph however, is doing full data writes to the journals to guarantee the atomicity of it's write operations. Normally in high performance situations we get around this by putting journals on SSDs, but this solution unfortunately doesn't have any. Another limitation is that the client network is using IPoIB, and on this hardware that means the clients will never see more than about 10GB/s aggregate throughput. Despite these limitations, we are scaling well and throughput to the storage chassis is pretty good!
  18. All of the results I've shown so far have been designed to showcase how much throughput we can push from the client and are not using any kind of replication. On the ORNL hardware this is probably justifiable because they are using RAID5 arrays behind the scenes and for some solutions like HPC scratch storage, running without replication may be acceptable. For a lot of folks Ceph's seamless support for replication is what makes it so compelling. So how much does replication hurt?
  19. As you might expect, replication has a pretty profound impact on write performance. Between doing journal writes and 3x replication, we see that client write performance is over 6 times slower than the actual write speed to the DDN chassis. What is probably more interesting is how the total write throughput changes. Going from 1x to 2x replication lowers the overall write performance by about 15-20%. When Ceph writes data to an OSD, the data must be written to the journal (BTRFS is a special case), and to the replica OSD's journal before the acknowledgement can be sent to the client. This not only results in extra data being written, but extra latency for every write. Read performance remains high regardless of replication.
  20. So again I've shown a bunch of RADOS bench results, but that's not what people ultimately care about. For high performance computing, customers really want CephFS: Our distributed POSIX filesystem. Before we go on, let me say that our block and object layers are production ready, but CephFS is still in beta. It's probably the most complex interface we have on top of Ceph and there are still a number of known bugs. We've also done very little performance tuning, so when we started this testing we were pretty unsure about how it would perform.
  21. To test CephFS, we settled on a tool that is commonly used in the HPC space called IOR. It coordinates IO on multiple client nodes using MPI and has many options that are useful for testing high performancre storage system s. When we first started testing CephFS, the performance was lower than we hoped. Through a series of tests, profiling, and investigation we were able to tweak the configuration to produce the results you see here. With 8 client nodes, writes are nearly as high as what we saw with RADOS bench, but reads have topped out lower. We are still seeing some variability in the results and have more tuning to do, but are happy with the performance we've been able to get so far given CephFS's level of maturity.
  22. The results that we've shown thus far are only a small sample of the barrage of tests that we run. We have hundreds, if not thousands of graphs and charts showcasing performance of Ceph on different hardware, and serving different kinds of IO. Given how open Ceph is, this is ultimately going to be a losing battle. There are too many platforms, too many applications, and too many ways performance can be impacted to capture them all.
  23. So given that you can't catch everything ahead of time, what do you do when cluster performance is lower than you expected? First, make a pot of coffee because you may be in for a long night. It's going to take some blood, sweat and tears, but luckily some other folks have paved the way and developed some very useful tools that can make the job a lot easier.
  24. Ha, ran out of time. Slide notes end here. :)