3. 1 The Problem
2 XFS w/ Realtime Subvolumes
3 GlusterFS Application
4 Facebook Enhanacements
5 Questions?
Agenda
4. Quick Tangent
Some Perspective: Strengths
▪ GlusterFS xlator architecture is elegant, don’t give this up…ever
▪ Tends to enforce good design on new developers
▪ Good isolation between components (with some cheating in some
places)
▪ IO Latencies: Yet to see anything in the open source realm come close
to the latencies we see on GlusterFS for metadata
▪ Simplicity: Very easy to setup, configure & manage.
▪ Open Source: Community is strong, growing with diverse use-cases
5. Quick Tangent
Some Perspective: Things we need to work on…
▪ Scaling: There continues to be a huge disparity between GlusterFS and
it’s contemporaries. Ceph & HDFS both scale dramatically more.
▪ Code Quality: More commenting! This helps bring in more developers,
ease code review.
▪ Lack of JBOD support: GlusterFS requires some form of RAID[5 -6] which
adds complexity and expense
▪ Drains/Rebuilds: Without use of some XFS tricks, this is still quite slow,
taking weeks vs. days.
6. The Problem
GlusterFS & Metadata
▪ Preface this…
▪ Leveraging underlying FS for metadata storage was a good decision
▪ -Simplicity of design, reliability
▪ But this design choice isn’t without problems…
▪ Heavy reliance on metadata
▪ xattrs à AFR “journal”, xlator attributes for DHT, quota etc
7. The Problem
GlusterFS & Metadata
▪ DHT directory operations “magnify” à Single ”ls” à 1000’s of VFS
system calls
Brick 8
ls –l /foo
10 ms 8 ms 5 ms 15 ms 2 ms 11 ms 50 ms 15 ms 20 ms
50 ms
Subvol 0
readdir
Optimize
Subvol 1 Subvol 2
Brick 1 Brick 4 Brick 7 Brick 2 Brick 5 Brick 3 Brick 6 Brick 9
AFR
Optimization
s
(Slow Sub-Vol)
8. The Problem
Quantifying the Problem
▪ blktrace to examine physical
block device Ios
▪ Examine RWBS field
▪ Example 1: CREATE heavy
▪ Blob store ’ish workload
▪ 51% metadata
CREATE heavy workload
9. The Problem
Quantifying the Problem
▪ Example 2: MKDIR heavy workload
▪ Deep directory structures
▪ ”FS as database” use-case
▪ 83% metadata
MKDIR heavy workload
10. The Problem
Traditional Solutions
▪ Page Cache
▪ Pro: Simplicity à just “works”; works with any storage system
▪ Cons: DRAM is expensive, limited space to cache objects, doesn’t help
write heavy use-cases.
▪ Dedicated Metadata Stores
▪ Pros: Good designs can scale well; works with write heavy workloads
▪ Cons: Added complexity à Reliability, management overhead;
maintaining consistency can be challenging; very specific to storage
system
▪ Can we combine the best of both?
11. XFS w/ Realtime Subvol
Overview
Realtime Block Device
Standard Block Device
Metadata
Intent
Log (Journal)
Data Blocks
XFS
Metadata
Intent
Log (Journal)
Data Blocks
Realtime Data
Blocks
XFS
Filesystem
Optional: Can
be moved.
XFS
Filesystem
14. GlusterFS Application
Benefits: Combines best of both worlds (mostly)
▪ Pros
▪ Simplicity à just “works”, no changes to GlusterFS core
▪ Works with any storage system
▪ Works with write or ready heavy metadata workloads
▪ SSD based File caching à Trivial to implement à “.cache” directory
▪ Cons
▪ Does not scale independently of bricks
▪ Changes require kernel patches potentially
▪ Realtime allocator is single-thread optimized, this may present problems for some
workloads à JBOD support?
15. Facebook Enhancements
Kernel Patches (Pending XFS maintainer review)
▪ statfs
▪ Return realtime device usage if rtinherit flag is set à More intuitive
▪ rtallocsize
▪ Small (initial) allocations stored on realtime subvolume automatically
▪ rtfallbackpct
▪ Allocate data to realtime device if data block device (e.g. SSD) usage is
above rtfallbackpct