GlusterFS w/ Tiered XFS

Richard Wareing
Production Engineer
October 27 th, 2017
Realtime XFS
Solving The Metadata Problem

1 The Problem
2 XFS w/ Realtime Subvolumes
3 GlusterFS Application
4 Facebook Enhanacements
5 Questions?
Agenda

Quick Tangent
Some Perspective: Strengths
▪ GlusterFS xlator architecture is elegant, don’t give this up…ever
▪ Tends to enforce good design on new developers
▪ Good isolation between components (with some cheating in some
places)
▪ IO Latencies: Yet to see anything in the open source realm come close
to the latencies we see on GlusterFS for metadata
▪ Simplicity: Very easy to setup, configure & manage.
▪ Open Source: Community is strong, growing with diverse use-cases

Quick Tangent
Some Perspective: Things we need to work on…
▪ Scaling: There continues to be a huge disparity between GlusterFS and
it’s contemporaries. Ceph & HDFS both scale dramatically more.
▪ Code Quality: More commenting! This helps bring in more developers,
ease code review.
▪ Lack of JBOD support: GlusterFS requires some form of RAID[5 -6] which
adds complexity and expense
▪ Drains/Rebuilds: Without use of some XFS tricks, this is still quite slow,
taking weeks vs. days.

The Problem
GlusterFS & Metadata
▪ Preface this…
▪ Leveraging underlying FS for metadata storage was a good decision
▪ -Simplicity of design, reliability
▪ But this design choice isn’t without problems…
▪ Heavy reliance on metadata
▪ xattrs à AFR “journal”, xlator attributes for DHT, quota etc

The Problem
GlusterFS & Metadata
▪ DHT directory operations “magnify” à Single ”ls” à 1000’s of VFS
system calls
Brick 8
ls –l /foo
10 ms 8 ms 5 ms 15 ms 2 ms 11 ms 50 ms 15 ms 20 ms
50 ms
Subvol 0
readdir
Optimize
Subvol 1 Subvol 2
Brick 1 Brick 4 Brick 7 Brick 2 Brick 5 Brick 3 Brick 6 Brick 9
AFR
Optimization
s
(Slow Sub-Vol)

The Problem
Quantifying the Problem
▪ blktrace to examine physical
block device Ios
▪ Examine RWBS field
▪ Example 1: CREATE heavy
▪ Blob store ’ish workload
▪ 51% metadata
CREATE heavy workload

The Problem
Quantifying the Problem
▪ Example 2: MKDIR heavy workload
▪ Deep directory structures
▪ ”FS as database” use-case
▪ 83% metadata
MKDIR heavy workload

The Problem
Traditional Solutions
▪ Page Cache
▪ Pro: Simplicity à just “works”; works with any storage system
▪ Cons: DRAM is expensive, limited space to cache objects, doesn’t help
write heavy use-cases.
▪ Dedicated Metadata Stores
▪ Pros: Good designs can scale well; works with write heavy workloads
▪ Cons: Added complexity à Reliability, management overhead;
maintaining consistency can be challenging; very specific to storage
system
▪ Can we combine the best of both?

XFS w/ Realtime Subvol
Overview
Realtime Block Device
Standard Block Device
Metadata
Intent
Log (Journal)
Data Blocks
XFS
Metadata
Intent
Log (Journal)
Data Blocks
Realtime Data
Blocks
XFS
Filesystem
Optional: Can
be moved.
XFS
Filesystem

XFS w/ Realtime Subvol
Realtime Allocator
RT Extent 1 RT Extent 2 RT Extent 3 RT Extent 4 RT Extent 5
RT Extent 6 RT Extent 7 RT Extent 8 RT Extent 9 RT Extent 10
… … … … …
… … … … …
… … … …
RT Extent n
Tunable to any Fixed size,
guaranteed contiguous!
Realtime Bitmap
Block Allocator

GlusterFS Application
Brick Filesystem Layout
/mnt/brick1
/mnt/brick2
sdb1
Metadata Intent Log
File Cache
(Optional)
sdb2
Metadata Intent Log
File Cache
(Optional)
sdc1
Data Blocks
sdd1
Data Blocks
SSD (sdb) HDDs (sdc-
sdd)

GlusterFS Application
Benefits: Combines best of both worlds (mostly)
▪ Pros
▪ Simplicity à just “works”, no changes to GlusterFS core
▪ Works with any storage system
▪ Works with write or ready heavy metadata workloads
▪ SSD based File caching à Trivial to implement à “.cache” directory
▪ Cons
▪ Does not scale independently of bricks
▪ Changes require kernel patches potentially
▪ Realtime allocator is single-thread optimized, this may present problems for some
workloads à JBOD support?

Facebook Enhancements
Kernel Patches (Pending XFS maintainer review)
▪ statfs
▪ Return realtime device usage if rtinherit flag is set à More intuitive
▪ rtallocsize
▪ Small (initial) allocations stored on realtime subvolume automatically
▪ rtfallbackpct
▪ Allocate data to realtime device if data block device (e.g. SSD) usage is
above rtfallbackpct

GlusterFS w/ Tiered XFS

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie GlusterFS w/ Tiered XFS

Ähnlich wie GlusterFS w/ Tiered XFS (20)

Mehr von Gluster.org

Mehr von Gluster.org (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

GlusterFS w/ Tiered XFS