Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

optimizing_ceph_flash

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 27 Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie optimizing_ceph_flash (20)

Anzeige

optimizing_ceph_flash

  1. 1. 1Investor Day 2014 | May 7, 2014 Optimizing Ceph for All-Flash Architectures Vijayendra Shamanna (Viju) Senior Manager Systems and Software Solutions
  2. 2. Systems and Software Solutions 2 Forward-Looking Statements During our meeting today we may make forward-looking statements. Any statement that refers to expectations, projections or other characterizations of future events or circumstances is a forward-looking statement, including those relating to future technologies, products and applications. Actual results may differ materially from those expressed in these forward-looking statements due to factors detailed under the caption “Risk Factors” and elsewhere in the documents we file from time to time with the SEC, including our annual and quarterly reports. We undertake no obligation to update these forward-looking statements, which speak only as of the date hereof
  3. 3. Systems and Software Solutions 3 What is Ceph?  A Distributed/Clustered Storage System  File, Object and Block interfaces to common storage substrate  Ceph is a mostly LGPL open source project – Part of Linux kernel and most distros since 2.6.x
  4. 4. Systems and Software Solutions 4 Ceph Architecture Client Client Client IP Network STORAGE NODE STORAGE NODE STORAGE NODE … …
  5. 5. Systems and Software Solutions 5 Ceph Deployment Modes WebServer RGW librados Guest OS Block Iface KVM /dev/rbd librados
  6. 6. Systems and Software Solutions 6 Ceph Operation Object Hash PG PG PG PG Placement Rules OSD OSD OSD OSD OSD PG PG … … libradosOSDs Interface (block, object, file) iface Client / Application
  7. 7. Systems and Software Solutions 7 Ceph Placement Rules and Data Protection PG[a] PG[b] PG[c] Server Server Server Rack
  8. 8. Systems and Software Solutions 8 Ceph’s built in Data Integrity & Recovery  Checksum based “patrol reads” – Local read of object, exchange of checksum over network  Journal-based Recovery – Short down time (tunable) – Re-syncs only changed objects  Object-based Recovery – Long down time OR partial storage loss – Object checksums exchanged over network
  9. 9. Systems and Software Solutions 9 Enterprise Storage Features FILE SYSTEM *BLOCK STORAGEOBJECT STORAGE Keystone authentication Geo-Replication Erasure Coding Striped Objects Incremental backups Open Stack integration Configurable Striping iSCSI CIFS/NFS Linux Kernel Configurable Striping S3 & Swift Multi-tenant RESTful Interface Thin Provisioning Copy-on Write Clones Snapshots Dynamic Rebalancing Distributed Metadata POSIX compliance * Future
  10. 10. Systems and Software Solutions 10 Software Architecture – Key Components of System built for Hyper Scale  Storage Clients – Standard Interface to use the data (POSIX, Device, Swift/S3…) – Transparent for Applications – Intelligent, Coordinates with peers  Object Storage Cluster (OSDs) – Stores all data and metadata into flexible- sized containers – Objects  Cluster Monitors – lightweight process for Authentication, Cluster Membership, Critical Cluster State  Clients authenticates with monitors, direct IO to OSDs for scalable performance  No gateways, brokers, lookup tables, indirection … Clients OSDs Ceph Key Components Block, ObjectIO Monitors Cluster Maps Compute Storage
  11. 11. Systems and Software Solutions 11 Ceph Interface Eco System Native Object Access (librados) File System (libcephfs) HDFSKernel Device + Fuse Rados Gateway (RGW) S3Swift Block Interface (librbd) Kernel SambaNFSiSCSI-cur Client RADOS (Reliable, Autonomous, Distributed Object Store)
  12. 12. Systems and Software Solutions 12 OSD Architecture work queue cluster client Messenger Dispatcher OSD thread pools PG Backend Replication Erasure Coding ObjectStore JournallingObjectStore KeyValueStore MemStore FileStore Transactions Network messages
  13. 13. Systems and Software Solutions 13 Messenger layer  Removed Dispatcher and introduced a “fast path” mechanism for read/write requests • Same mechanism is now present on client side (librados) as well  Fine grained locking in message transmit path  Introduced an efficient buffering mechanism for improved throughput  Configuration options to disable message signing, CRC check etc
  14. 14. Systems and Software Solutions 14 OSD Request Processing  Running with Memstore backend revealed bottleneck in OSD thread pool code  OSD worker thread pool mutex heavily contended  Implemented a sharded worker thread pool. Requests sharded based on their pg (placement group) identifier  Configuration options to set number of shards and number of worker threads per shard  Optimized OpTracking path (Sharded Queue and removed redundant locks)
  15. 15. Systems and Software Solutions 15 FileStore improvements  Eliminated backend storage from picture by using a small workload (FileStore served data from page cache)  Severe lock contention in LRU FD (file descriptor) cache. Implemented a sharded version of LRU cache  CollectionIndex (per-PG) object was being created upon every IO request. Implemented a cache for the same as PG info doesn’t change often  Optimized “Object-name to XFS file name” mapping function  Removed redundant snapshot related checks in parent read processing path
  16. 16. Systems and Software Solutions 16 Inconsistent Performance Observation  Large performance variations on different pools across multiple clients  First client after cluster restart gets maximum performance irrespective of the pool  Continued degraded performance from clients starting later  Issue also observed on read I/O with unpopulated RBD images – Ruled out FS issues  Performance counters show up to 3x increase in latency through the I/O path with no particular bottleneck
  17. 17. Systems and Software Solutions 17 Issue with TCmalloc  Perf top shows rapid increase in time spent in TCmalloc functions 14.75% libtcmalloc.so.4.1.2 [.] tcmalloc::CentralFreeList::FetchFromSpans() 7.46% libtcmalloc.so.4.1.2 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)  I/O from different client causing new threads in sharded thread pool to process I/O  Causing memory movement from thread caches and increasing alloc/free latency  JEmalloc and Glibc malloc do not exhibit this behavior  JEmalloc build option added to Ceph Hammer  Setting TCmalloc tunable 'TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES’ to larger value (64M) alleviates the issue
  18. 18. Systems and Software Solutions 18 Client Optimizations  Ceph by default, turns Nagle’s algorithm OFF  RBD kernel driver ignored TCP_NODELAY setting  Large latency variations at lower queue depths  Changes to RBD driver submitted upstream
  19. 19. Emerging Storage Solutions (EMS) SanDisk Confidential 19 Results!
  20. 20. Systems and Software Solutions 20 Hardware Topology
  21. 21. Systems and Software Solutions 21 8K Random IOPs Performance -1 RBD /Client IOPS :1 Lun/Client ( Total 4 Clients) [Queue Depth] Read Percent 0 50000 100000 150000 200000 250000 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Firefly Giant Lat (ms):1 Lun/Client ( Total 4 Clients) 0 5 10 15 20 25 30 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100
  22. 22. Systems and Software Solutions 22 64K Random IOPs Performance -1 RBD /Client IOPS :1 Lun/Client ( Total 4 Clients) [Queue Depth] Read Percent 0 20000 40000 60000 80000 100000 120000 140000 160000 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Firefly Giant Lat(ms) :1 Lun/Client ( Total 4 Clients) 0 5 10 15 20 25 30 35 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100
  23. 23. Systems and Software Solutions 23 256K Random IOPs Performance -1 RBD /Client IOPS :1 Lun/Client ( Total 4 Clients) Lat(ms) :1 Lun/Client ( Total 4 Clients) [Queue Depth] Read Percent 0 5000 10000 15000 20000 25000 30000 35000 40000 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Firefly Giant 0 10 20 30 40 50 60 70 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100
  24. 24. Systems and Software Solutions 24 8K Random IOPs Performance -2 RBD /Client IOPS :2 Luns /Client ( Total 4 Clients) 0 50000 100000 150000 200000 250000 300000 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100 Firefly Giant Lat(ms) :2 Luns/Client ( Total 4 Clients) [Queue Depth] Read Percent 0 20 40 60 80 100 120 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 0 25 50 75 100
  25. 25. SanDisk Confidential 25 SanDisk Emerging Storage Solutions (EMS) Our goal for EMS  Expand the usage of flash for newer enterprise workloads (Ex: Ceph, Hadoop, Content Repositories, Media Streaming, Clustered Database Applications, etc)  … through disruptive innovation of “disaggregation” of storage from compute  … by building shared storage solutions  … thereby allowing customers to lower TCO relative to HDD based offerings through scaling storage and compute separately Mission: Provide the best building blocks for “shared storage” flash solutions.
  26. 26. SanDisk Confidential 26 SanDisk Systems & Software Solutions Building Blocks Transforming Hyperscale InfiniFlash™ Capacity & High Density for Big Data Analytics, Media Services, Content Repositories 512TB 3U, 780K IOPS, SAS storage Compelling TCO– Space/Energy Savings ION Accelerator™ Accelerate Database, OLTP & other high performance workloads 1.7M IOPS, 23 GB/s, 56 us latency Lower costs with IT consolidation FlashSoft® Server-side SSD based caching to reduce I/O Latency Improve application performance Maximize storage infrastructure investment ZetaScale™ Lower TCO while retaining performance for in-memory databases Run multiple TB data sets on SSDs vs. GB data set on DRAM Deliver up to 5:1 server consolidation, improving TCO Massive Capacity Extreme Performance for Applications.. .. Servers/Virtualization
  27. 27. SanDisk Confidential 27 Thank you ! Questions? vijayendra.shamanna@sandisk.com

×