SlideShare a Scribd company logo
1 of 30
Future of CephFS
Sage Weil
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
MM
MM
MM
CLIENTCLIENT
01
10
01
10
data
metadata
MM
MM
MM
Metadata Server
• Manages metadata for a
POSIX-compliant shared
filesystem
• Directory hierarchy
• File metadata (owner,
timestamps, mode, etc.)
• Stores metadata in RADOS
• Does not serve file data to
clients
• Only required for shared
filesystem
legacy metadata storage
●
a scaling disaster
●
name inode block list→ → →
data
●
no inode table locality
●
fragmentation
– inode table
– directory
● many seeks
●
difficult to partition
usr
etc
var
home
vmlinuz
passwd
mtab
hosts
lib
…
…
…
include
bin
ceph fs metadata storage
●
block lists unnecessary
● inode table mostly useless
●
APIs are path-based, not
inode-based
●
no random table access,
sloppy caching
● embed inodes inside
directories
●
good locality, prefetching
●
leverage key/value object
102
100
1
usr
etc
var
home
vmlinuz
passwd
mtab
hosts
lib
include
bin
…
…
…
controlling metadata io
● view ceph-mds as cache
●
reduce reads
– dir+inode prefetching
●
reduce writes
– consolidate multiple writes
●
large journal or log
●
stripe over objects
●
two tiers
– journal for short term
– per-directory for long term
●
fast failure recovery
journal
directories
one tree
three metadata servers
??
load distribution
●
coarse (static subtree)
●
preserve locality
●
high management overhead
●
fine (hash)
●
always balanced
●
less vulnerable to hot spots
●
destroy hierarchy, locality
●
can a dynamic approach
capture benefits of both
extremes?
static subtree
hash directories
hash files
good locality
good balance
DYNAMIC SUBTREE PARTITIONING
●
scalable
●
arbitrarily partition
metadata
● adaptive
●
move work from busy to
idle servers
●
replicate hot metadata
●
efficient
●
hierarchical partition
preserve locality
● dynamic
●
daemons can join/leave
●
take over for failed nodes
dynamic subtree partitioning
Dynamic partitioning
many directories same directory
Failure recovery
Metadata replication and availability
Metadata cluster scaling
client protocol
●
highly stateful
●
consistent, fine-grained caching
● seamless hand-off between ceph-mds daemons
●
when client traverses hierarchy
●
when metadata is migrated between servers
● direct access to OSDs for file I/O
an example
● mount -t ceph 1.2.3.4:/ /mnt
●
3 ceph-mon RT
●
2 ceph-mds RT (1 ceph-mds to -osd RT)
● cd /mnt/foo/bar
●
2 ceph-mds RT (2 ceph-mds to -osd RT)
● ls -al
●
open
●
readdir
– 1 ceph-mds RT (1 ceph-mds to -osd RT)
●
stat each file
●
close
● cp * /tmp
●
N ceph-osd RT
ceph-mon
ceph-mds
ceph-osd
recursive accounting
●
ceph-mds tracks recursive directory stats
●
file sizes
●
file and directory counts
●
modification time
●
virtual xattrs present full stats
● efficient
$ ls ­alSh | head
total 0
drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 .
drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 ..
drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph
drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1
drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko
drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest
drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2
drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph
drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
snapshots
●
volume or subvolume snapshots unusable at petabyte
scale
●
snapshot arbitrary subdirectories
●
simple interface
●
hidden '.snap' directory
●
no special tools
$ mkdir foo/.snap/one # create snapshot
$ ls foo/.snap
one
$ ls foo/bar/.snap
_one_1099511627776 # parent's snap name is mangled
$ rm foo/myfile
$ ls -F foo
bar/
$ ls -F foo/.snap/one
myfile bar/
$ rmdir foo/.snap/one # remove snapshot
multiple client implementations
●
Linux kernel client
●
mount -t ceph 1.2.3.4:/
/mnt
●
export (NFS), Samba (CIFS)
● ceph-fuse
●
libcephfs.so
●
your app
●
Samba (CIFS)
●
Ganesha (NFS)
●
Hadoop (map/reduce)
kernel
libcephfs
ceph fuse
ceph-fuse
your app
libcephfs
Samba
libcephfs
Ganesha
NFS SMB/CIFS
libcephfs
Hadoop
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing,
intelligent storage nodes
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
LIBRADOS
A library allowing
apps to directly
access RADOS,
with support for
C, C++, Java,
Python, Ruby,
and PHP
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RBD
A reliable and fully-
distributed block
device, with a Linux
kernel client and a
QEMU/KVM driver
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
RADOSGW
A bucket-based
REST gateway,
compatible with S3
and Swift
APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
CEPH FS
A POSIX-compliant
distributed file system,
with a Linux kernel
client and support for
FUSE
NEARLY
AWESOME
AWESOMEAWESOME
AWESOME
AWESOME
Path forward
●
Testing
●
Various workloads
●
Multiple active MDSs
●
Test automation
●
Simple workload generator scripts
●
Bug reproducers
●
Hacking
●
Bug squashing
●
Long-tail features
●
Integrations
●
Ganesha, Samba, *stacks
hard links?
● rare
● useful locality properties
●
intra-directory
●
parallel inter-directory
● on miss, file objects provide per-file
backpointers
● degenerates to log(n) lookups
● optimistic read complexity
what is journaled
●
lots of state
●
journaling is expensive up-front, cheap to recover
●
non-journaled state is cheap, but complex (and somewhat
expensive) to recover
● yes
●
client sessions
●
actual fs metadata modifications
● no
●
cache provenance
●
open files
● lazy flush
●
client modifications may not be durable until fsync() or visible by
another client

More Related Content

What's hot

What's hot (17)

SF Ceph Users Jan. 2014
SF Ceph Users Jan. 2014SF Ceph Users Jan. 2014
SF Ceph Users Jan. 2014
 
What's new in Jewel and Beyond
What's new in Jewel and BeyondWhat's new in Jewel and Beyond
What's new in Jewel and Beyond
 
Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)Distributed Storage and Compute With Ceph's librados (Vault 2015)
Distributed Storage and Compute With Ceph's librados (Vault 2015)
 
Keeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containersKeeping OpenStack storage trendy with Ceph and containers
Keeping OpenStack storage trendy with Ceph and containers
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Ceph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to JewelCeph Performance: Projects Leading up to Jewel
Ceph Performance: Projects Leading up to Jewel
 
librados
libradoslibrados
librados
 
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud StorageCeph, Now and Later: Our Plan for Open Unified Cloud Storage
Ceph, Now and Later: Our Plan for Open Unified Cloud Storage
 
An intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data WorkshopAn intro to Ceph and big data - CERN Big Data Workshop
An intro to Ceph and big data - CERN Big Data Workshop
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Ceph as storage for CloudStack
Ceph as storage for CloudStack Ceph as storage for CloudStack
Ceph as storage for CloudStack
 
Cache Tiering and Erasure Coding
Cache Tiering and Erasure CodingCache Tiering and Erasure Coding
Cache Tiering and Erasure Coding
 
BlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for CephBlueStore: a new, faster storage backend for Ceph
BlueStore: a new, faster storage backend for Ceph
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Hadoop over rgw
Hadoop over rgwHadoop over rgw
Hadoop over rgw
 
Bluestore
BluestoreBluestore
Bluestore
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016
 

Similar to Ceph Day NYC: The Future of CephFS

Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with ceph
Ian Colle
 
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondNew Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and Beyond
OpenStack Foundation
 
Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development
Ceph Community
 

Similar to Ceph Day NYC: The Future of CephFS (20)

London Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFSLondon Ceph Day: The Future of CephFS
London Ceph Day: The Future of CephFS
 
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with LibradosCeph Day Santa Clara: The Future of CephFS + Developing with Librados
Ceph Day Santa Clara: The Future of CephFS + Developing with Librados
 
Block Storage For VMs With Ceph
Block Storage For VMs With CephBlock Storage For VMs With Ceph
Block Storage For VMs With Ceph
 
XenSummit - 08/28/2012
XenSummit - 08/28/2012XenSummit - 08/28/2012
XenSummit - 08/28/2012
 
Openstack with ceph
Openstack with cephOpenstack with ceph
Openstack with ceph
 
New features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and BeyondNew features for Ceph with Cinder and Beyond
New features for Ceph with Cinder and Beyond
 
New Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and BeyondNew Features for Ceph with Cinder and Beyond
New Features for Ceph with Cinder and Beyond
 
OSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage SystemOSDC 2015: John Spray | The Ceph Storage System
OSDC 2015: John Spray | The Ceph Storage System
 
Ceph Day NYC: Ceph Fundamentals
Ceph Day NYC: Ceph FundamentalsCeph Day NYC: Ceph Fundamentals
Ceph Day NYC: Ceph Fundamentals
 
Red Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) OverviewRed Hat Storage 2014 - Product(s) Overview
Red Hat Storage 2014 - Product(s) Overview
 
The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red Hat
The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red HatThe Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red Hat
The Future of Cloud Software Defined Storage with Ceph: Andrew Hatfield, Red Hat
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices:  A Deep DiveCeph Block Devices:  A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph Block Devices: A Deep Dive
Ceph Block Devices: A Deep DiveCeph Block Devices: A Deep Dive
Ceph Block Devices: A Deep Dive
 
Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Day London 2014 - Ceph Ecosystem Overview Ceph Day London 2014 - Ceph Ecosystem Overview
Ceph Day London 2014 - Ceph Ecosystem Overview
 
What you need to know about ceph
What you need to know about cephWhat you need to know about ceph
What you need to know about ceph
 
Ceph - Desmistificando Software-Define Storage
Ceph - Desmistificando Software-Define StorageCeph - Desmistificando Software-Define Storage
Ceph - Desmistificando Software-Define Storage
 
Ceph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver MeetupCeph Overview for Distributed Computing Denver Meetup
Ceph Overview for Distributed Computing Denver Meetup
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development
 
Storage Developer Conference - 09/19/2012
Storage Developer Conference - 09/19/2012Storage Developer Conference - 09/19/2012
Storage Developer Conference - 09/19/2012
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Ceph Day NYC: The Future of CephFS

  • 2. RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE RADOSGW A bucket-based REST gateway, compatible with S3 and Swift RADOSGW A bucket-based REST gateway, compatible with S3 and Swift APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT
  • 5. Metadata Server • Manages metadata for a POSIX-compliant shared filesystem • Directory hierarchy • File metadata (owner, timestamps, mode, etc.) • Stores metadata in RADOS • Does not serve file data to clients • Only required for shared filesystem
  • 6. legacy metadata storage ● a scaling disaster ● name inode block list→ → → data ● no inode table locality ● fragmentation – inode table – directory ● many seeks ● difficult to partition usr etc var home vmlinuz passwd mtab hosts lib … … … include bin
  • 7. ceph fs metadata storage ● block lists unnecessary ● inode table mostly useless ● APIs are path-based, not inode-based ● no random table access, sloppy caching ● embed inodes inside directories ● good locality, prefetching ● leverage key/value object 102 100 1 usr etc var home vmlinuz passwd mtab hosts lib include bin … … …
  • 8. controlling metadata io ● view ceph-mds as cache ● reduce reads – dir+inode prefetching ● reduce writes – consolidate multiple writes ● large journal or log ● stripe over objects ● two tiers – journal for short term – per-directory for long term ● fast failure recovery journal directories
  • 10. load distribution ● coarse (static subtree) ● preserve locality ● high management overhead ● fine (hash) ● always balanced ● less vulnerable to hot spots ● destroy hierarchy, locality ● can a dynamic approach capture benefits of both extremes? static subtree hash directories hash files good locality good balance
  • 11.
  • 12.
  • 13.
  • 14.
  • 16. ● scalable ● arbitrarily partition metadata ● adaptive ● move work from busy to idle servers ● replicate hot metadata ● efficient ● hierarchical partition preserve locality ● dynamic ● daemons can join/leave ● take over for failed nodes dynamic subtree partitioning
  • 19. Metadata replication and availability
  • 21. client protocol ● highly stateful ● consistent, fine-grained caching ● seamless hand-off between ceph-mds daemons ● when client traverses hierarchy ● when metadata is migrated between servers ● direct access to OSDs for file I/O
  • 22. an example ● mount -t ceph 1.2.3.4:/ /mnt ● 3 ceph-mon RT ● 2 ceph-mds RT (1 ceph-mds to -osd RT) ● cd /mnt/foo/bar ● 2 ceph-mds RT (2 ceph-mds to -osd RT) ● ls -al ● open ● readdir – 1 ceph-mds RT (1 ceph-mds to -osd RT) ● stat each file ● close ● cp * /tmp ● N ceph-osd RT ceph-mon ceph-mds ceph-osd
  • 23. recursive accounting ● ceph-mds tracks recursive directory stats ● file sizes ● file and directory counts ● modification time ● virtual xattrs present full stats ● efficient $ ls ­alSh | head total 0 drwxr­xr­x 1 root            root      9.7T 2011­02­04 15:51 . drwxr­xr­x 1 root            root      9.7T 2010­12­16 15:06 .. drwxr­xr­x 1 pomceph         pg4194980 9.6T 2011­02­24 08:25 pomceph drwxr­xr­x 1 mcg_test1       pg2419992  23G 2011­02­02 08:57 mcg_test1 drwx­­x­­­ 1 luko            adm        19G 2011­01­21 12:17 luko drwx­­x­­­ 1 eest            adm        14G 2011­02­04 16:29 eest drwxr­xr­x 1 mcg_test2       pg2419992 3.0G 2011­02­02 09:34 mcg_test2 drwx­­x­­­ 1 fuzyceph        adm       1.5G 2011­01­18 10:46 fuzyceph drwxr­xr­x 1 dallasceph      pg275     596M 2011­01­14 10:06 dallasceph
  • 24. snapshots ● volume or subvolume snapshots unusable at petabyte scale ● snapshot arbitrary subdirectories ● simple interface ● hidden '.snap' directory ● no special tools $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls -F foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot
  • 25. multiple client implementations ● Linux kernel client ● mount -t ceph 1.2.3.4:/ /mnt ● export (NFS), Samba (CIFS) ● ceph-fuse ● libcephfs.so ● your app ● Samba (CIFS) ● Ganesha (NFS) ● Hadoop (map/reduce) kernel libcephfs ceph fuse ceph-fuse your app libcephfs Samba libcephfs Ganesha NFS SMB/CIFS libcephfs Hadoop
  • 26. RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP LIBRADOS A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Ruby, and PHP RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RBD A reliable and fully- distributed block device, with a Linux kernel client and a QEMU/KVM driver RADOSGW A bucket-based REST gateway, compatible with S3 and Swift RADOSGW A bucket-based REST gateway, compatible with S3 and Swift APPAPP APPAPP HOST/VMHOST/VM CLIENTCLIENT CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE CEPH FS A POSIX-compliant distributed file system, with a Linux kernel client and support for FUSE NEARLY AWESOME AWESOMEAWESOME AWESOME AWESOME
  • 27. Path forward ● Testing ● Various workloads ● Multiple active MDSs ● Test automation ● Simple workload generator scripts ● Bug reproducers ● Hacking ● Bug squashing ● Long-tail features ● Integrations ● Ganesha, Samba, *stacks
  • 28.
  • 29. hard links? ● rare ● useful locality properties ● intra-directory ● parallel inter-directory ● on miss, file objects provide per-file backpointers ● degenerates to log(n) lookups ● optimistic read complexity
  • 30. what is journaled ● lots of state ● journaling is expensive up-front, cheap to recover ● non-journaled state is cheap, but complex (and somewhat expensive) to recover ● yes ● client sessions ● actual fs metadata modifications ● no ● cache provenance ● open files ● lazy flush ● client modifications may not be durable until fsync() or visible by another client

Editor's Notes

  1. Finally, let’s talk about Ceph FS. Ceph FS is a parallel filesystem that provides a massively scalable, single-hierarchy, shared disk. If you use a shared drive at work, this is the same thing except that the same drive could be shared by everyone you’ve ever met (and everyone they’ve ever met).
  2. Remember all that meta-data we talked about in the beginning? Feels so long ago. It has to be stored somewhere! Something has to keep track of who created files, when they were created, and who has the right to access them. And something has to remember where they live within a tree. Enter MDS, the Ceph Metadata Server. Clients accessing Ceph FS data first make a request to an MDS, which provides what they need to get files from the right OSDs.
  3. There are multiple MDSs!
  4. If you aren’t running Ceph FS, you don’t need to deploy metadata servers.
  5. So how do you have one tree and multiple servers?
  6. If there’s just one MDS (which is a terrible idea), it manages metadata for the entire tree.
  7. When the second one comes along, it will intelligently partition the work by taking a subtree.
  8. When the third MDS arrives, it will attempt to split the tree again.
  9. Same with the fourth.
  10. A MDS can actually even just take a single directory or file, if the load is high enough. This all happens dynamically based on load and the structure of the data, and it’s called “dynamic subtree partitioning”.
  11. Ceph FS is feature-complete but still lacks the testing, quality assurance, and benchmarking work we feel it needs to recommend it for production use.