SlideShare a Scribd company logo
1 of 21
Download to read offline
RIO Distribution:
Reconstructing
the onion!
Shyamsundar Ranganathan
Developer
What's with the onion
anyway!
● View the xlators in the volume graph as layers of an onion
○ At the heart is the posix xlator, and the outermost is the access interface
○ Each layer has well understood and standard membranes
○ But, unlike an onion, each layer functions differently!
● More metaphorically,
○ Peeling the layers apart can make you cry
○ But, is a cleansing ritual (at times!)
● This presentation is about,
"How the volume graph is reconstructed changing distribution to RIO?"
RIO Distribution in a nutshell
● RIO stands for Relation Inherited Object distribution
○ or, a.was.k.a DHT2
● Objects:
○ inode objects
■ further classified as directory inodes and file inodes
○ data objects
■ Exists for file inodes only
The file system objects
(example)
User View
. (‘root’)
File1
Dir1
File2
Dir2
Dir Object File Object
Data
Data
root
File2
Dir2Dir1
File1
The file system objects
(example)
inodes/dinode File data
1
A
CB
D
A
D
A Data Object
. (‘root’)
User View
File1
Dir1
File2
Dir2
RIO Distribution in a nutshell
● RIO stands for Relation Inherited Object distribution
○ or, a.was.k.a DHT2
● Objects:
○ inode objects
■ further classified as directory inodes and file inodes
○ data objects
■ Exists for file inodes only
● Relations:
○ inode objects to parent inode objects
○ data objects to file inode objects
Dir Object File Object
Data
Data
root
File2
Dir2Dir1
File1
Object relations
inodes/dinode File data
1
A
CB
D
A
D
A Data Object
. (‘root’)
User View
File1
Dir1
File2
Dir2
RIO Distribution in a nutshell
● RIO stands for Relation Inherited Object distribution
○ or, a.was.k.a DHT2
● Objects:
○ inode objects
■ further classified as directory inodes and file inodes
○ data objects
■ Exists for file inodes only
● Relations:
○ inode objects to parent inode objects
○ data objects to file inode objects
● Inheritance
○ file inode inherits location based on parent inode,
■ Directory inodes do not inherit this property, hence achieves distribution
○ data object inherits GFID/inode# of the file inode
Dir Object File Object
Object inheritance
Metadata Ring
(few bricks)
Data Ring
(many bricks)
1
A
CB
D
A
D
Data Object
Bricks/Subvols
Data
root
File2
Dir2Dir1
File1
. (‘root’)
User View
File1
Dir1
File2
Dir2
RIO Distribution in a nutshell
(contd.)
● Salient principles:
○ Each directory belongs to a single RIO subvolume
○ Files inode and data is separated into different subvolumes
■ Thus, there are 2 rings for distribution of objects
○ Layout is per ring and is common to all objects in the ring
● Rationale:
○ Improve scalability and consistency
○ Retain, and in some cases, improve metadata performance
○ Improved rebalance
Peeking into the RIO layer
● Consistency handling
○ Handle cross client operation consistency from a single point
○ Funnel entry operations, to the parent inode location, and metadata
operations to the inode location
○ Hence, RIO is split as RIOc(lient) and RIOs(erver)
■ RIOc is a router for operations
■ RIOs manages FOP consistency and transaction needs
● Journals
○ Transactions need journals to avoid orphan inodes
○ Managing on-disk cache of size and time on inode, based on data
operations, needs journals to track dirty inodes
Rebuilding the Onion
(DHT -> RIO)
<prior xlators>
DHT
Client/Server
Protocol
POSIX xlator
<prior xlators>
Client/Server
Protocol
POSIX xlator
RIO Client
RIO Server
Other intervening xlators
Direct descendant in graph
Changed layer
Legend:
Changed on-disk format
● New on disk backend, like the .glusterfs
namespace on the bricks
● Reuse existing posix xlator for inode and
fd OPs
○ redefine entry FOPs (and lookup)
● Re-add dentry backpointers
○ Needed for hard link and rename
operations
● Extend utime xlator work, for caching,
○ time information (a/mtime)
○ size information (size, blocks, IO block)
<prior xlators>
Client/Server
Protocol
POSIX2 xlator
RIO Client
RIO Server
Differentiating the 2 rings
● Server component on metadata
(MDC) and data (DC) rings have
differing functionality
○ Entry operations are all on MDC
○ Data operations are all on DC
○ Some other operations can happen
across both (e.g fsync, xattrs)
● Orphan inodes are a MDC feature
● Dirty inodes are a MDC feature
● MDC needs to talk to DC for dirty
inode cache updates
<prior xlators>
POSIX2 MDS
xlator
RIO Client
RIO MDS
Server
POSIX2 DS
xlator
RIO DS Server
Changing the abstractions
● iatt assumptions:
○ Operations on data and metadata would return incomplete iatt
information
○ Looking at iatt2 or statx like extensions to disambiguate the
same
● Anonfd:
○ Changes to data need to be reflected in the metadata cache
(dirty inodes)
○ Hence active inodes need to be tracked on the metadata servers
○ Thus, ending the use of anonfd’s!?
● The MDC needs replication, and never disperse!
○ There is no data on the MDC
○ Only metadata needs to be replicated and made available
Adding the availability layers
● Consistency handling by RIOs needs a leader, without which,
○ <n> RIOs instances on the replicas need to resolve who creates the
inode and who links the name
○ locking becomes a cross replicate subvolume operation, reducing
efficiency
● Brings LEX (Leader Election Xlator) and possibly DRC-like
(Duplicate Request Cache, like xlator)
Adding the availability layers
<prior xlators>
Client/Server
(Leader)
POSIX2
RIO Client
RIO MDS Server
Client/Server
(Follower)
Client/Server
Protocol
AFR
AFR/Disperse
RIO DS Server
As we scale out...
NOTE: Legend in slide 12 does not hold good for this image
Thoughts on some other
layers
● Unsupported
○ Quota, Tier (?)
● Sharding
○ Based on name distribution in DHT, RIO is based on GFID based
distribution
■ Still gets a new GFID for every shard, but cost is higher with this methodology
○ Thinking of ways to distribute data, based on offsets, and leave the rest
of the shard handling to its own xlator
● GFProxy does not need to change much,
○ Possibly DC can be have a proxy, and MDC can be without one, when,
#MDS << #DS
● Layers and functionality that need to adapt
○ Tier(?), geo-rep
Status
● Work happening on the experimental branch
○ RIOc/s(ds/mds), POSIX2 xlators being worked on
○ Current contributors: Kotresh, Susant, Shyam
● Abilities supported as of this writing,
○ Directory/File creation and data operations
○ xattr, stat for all inodes
● Missing abilities
○ Directory listing, unlink, rename
● Target is alpha release with gluster 4.0
○ Tracked using this github issue #243
Questions?

More Related Content

Similar to RIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan

Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine
 

Similar to RIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan (20)

Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
DHT2 - O Brother, Where Art Thou with Shyam Ranganathan
DHT2 - O Brother, Where Art Thou with 	Shyam RanganathanDHT2 - O Brother, Where Art Thou with 	Shyam Ranganathan
DHT2 - O Brother, Where Art Thou with Shyam Ranganathan
 
Log forwarding at Scale
Log forwarding at ScaleLog forwarding at Scale
Log forwarding at Scale
 
Fluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at ScaleFluent Bit: Log Forwarding at Scale
Fluent Bit: Log Forwarding at Scale
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph2019.06.27 Intro to Ceph
2019.06.27 Intro to Ceph
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Bsdtw17: george neville neil: realities of dtrace on free-bsd
Bsdtw17: george neville neil: realities of dtrace on free-bsdBsdtw17: george neville neil: realities of dtrace on free-bsd
Bsdtw17: george neville neil: realities of dtrace on free-bsd
 
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
KubeCon EU 2019 - P2P Docker Image Distribution in Hybrid Cloud Environment w...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
HDF5 Life cycle of data
HDF5 Life cycle of dataHDF5 Life cycle of data
HDF5 Life cycle of data
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
Bigdata and Hadoop
 Bigdata and Hadoop Bigdata and Hadoop
Bigdata and Hadoop
 
When ACLs Attack
When ACLs AttackWhen ACLs Attack
When ACLs Attack
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
OpenZFS - AsiaBSDcon
OpenZFS - AsiaBSDconOpenZFS - AsiaBSDcon
OpenZFS - AsiaBSDcon
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
 
Brett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4jBrett Ragozzine - Graph Databases and Neo4j
Brett Ragozzine - Graph Databases and Neo4j
 

More from Gluster.org

nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravaranfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
Gluster.org
 
Facebook’s upstream approach to GlusterFS - David Hasson
Facebook’s upstream approach to GlusterFS  - David HassonFacebook’s upstream approach to GlusterFS  - David Hasson
Facebook’s upstream approach to GlusterFS - David Hasson
Gluster.org
 

More from Gluster.org (20)

Automating Gluster @ Facebook - Shreyas Siravara
Automating Gluster @ Facebook - Shreyas SiravaraAutomating Gluster @ Facebook - Shreyas Siravara
Automating Gluster @ Facebook - Shreyas Siravara
 
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravaranfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
nfusr: a new userspace NFS client based on libnfs - Shreyas Siravara
 
Facebook’s upstream approach to GlusterFS - David Hasson
Facebook’s upstream approach to GlusterFS  - David HassonFacebook’s upstream approach to GlusterFS  - David Hasson
Facebook’s upstream approach to GlusterFS - David Hasson
 
Throttling Traffic at Facebook Scale
Throttling Traffic at Facebook ScaleThrottling Traffic at Facebook Scale
Throttling Traffic at Facebook Scale
 
GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS  GlusterFS w/ Tiered XFS
GlusterFS w/ Tiered XFS
 
Gluster Metrics: why they are crucial for running stable deployments of all s...
Gluster Metrics: why they are crucial for running stable deployments of all s...Gluster Metrics: why they are crucial for running stable deployments of all s...
Gluster Metrics: why they are crucial for running stable deployments of all s...
 
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
Up and Running with Glusto & Glusto-Tests in 5 Minutes (or less)
 
Data Reduction for Gluster with VDO
Data Reduction for Gluster with VDOData Reduction for Gluster with VDO
Data Reduction for Gluster with VDO
 
Releases: What are contributors responsible for
Releases: What are contributors responsible forReleases: What are contributors responsible for
Releases: What are contributors responsible for
 
Gluster and Kubernetes
Gluster and KubernetesGluster and Kubernetes
Gluster and Kubernetes
 
Native Clients, more the merrier with GFProxy!
Native Clients, more the merrier with GFProxy!Native Clients, more the merrier with GFProxy!
Native Clients, more the merrier with GFProxy!
 
Gluster: a SWOT Analysis
Gluster: a SWOT Analysis Gluster: a SWOT Analysis
Gluster: a SWOT Analysis
 
GlusterD-2.0: What's Happening? - Kaushal Madappa
GlusterD-2.0: What's Happening? - Kaushal MadappaGlusterD-2.0: What's Happening? - Kaushal Madappa
GlusterD-2.0: What's Happening? - Kaushal Madappa
 
Scalability and Performance of CNS 3.6
Scalability and Performance of CNS 3.6Scalability and Performance of CNS 3.6
Scalability and Performance of CNS 3.6
 
What Makes Us Fail
What Makes Us FailWhat Makes Us Fail
What Makes Us Fail
 
Gluster as Native Storage for Containers - past, present and future
Gluster as Native Storage for Containers - past, present and futureGluster as Native Storage for Containers - past, present and future
Gluster as Native Storage for Containers - past, present and future
 
Heketi Functionality into Glusterd2
Heketi Functionality into Glusterd2Heketi Functionality into Glusterd2
Heketi Functionality into Glusterd2
 
Hands On Gluster with Jeff Darcy
Hands On Gluster with Jeff DarcyHands On Gluster with Jeff Darcy
Hands On Gluster with Jeff Darcy
 
Architecture of the High Availability Solution for Ganesha and Samba with Kal...
Architecture of the High Availability Solution for Ganesha and Samba with Kal...Architecture of the High Availability Solution for Ganesha and Samba with Kal...
Architecture of the High Availability Solution for Ganesha and Samba with Kal...
 
Gluster Containerized Storage for Cloud Applications
Gluster Containerized Storage for Cloud ApplicationsGluster Containerized Storage for Cloud Applications
Gluster Containerized Storage for Cloud Applications
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 

RIO Distribution: Reconstructing the onion - Shyamsundar Ranganathan

  • 2. What's with the onion anyway! ● View the xlators in the volume graph as layers of an onion ○ At the heart is the posix xlator, and the outermost is the access interface ○ Each layer has well understood and standard membranes ○ But, unlike an onion, each layer functions differently! ● More metaphorically, ○ Peeling the layers apart can make you cry ○ But, is a cleansing ritual (at times!) ● This presentation is about, "How the volume graph is reconstructed changing distribution to RIO?"
  • 3. RIO Distribution in a nutshell ● RIO stands for Relation Inherited Object distribution ○ or, a.was.k.a DHT2 ● Objects: ○ inode objects ■ further classified as directory inodes and file inodes ○ data objects ■ Exists for file inodes only
  • 4. The file system objects (example) User View . (‘root’) File1 Dir1 File2 Dir2
  • 5. Dir Object File Object Data Data root File2 Dir2Dir1 File1 The file system objects (example) inodes/dinode File data 1 A CB D A D A Data Object . (‘root’) User View File1 Dir1 File2 Dir2
  • 6. RIO Distribution in a nutshell ● RIO stands for Relation Inherited Object distribution ○ or, a.was.k.a DHT2 ● Objects: ○ inode objects ■ further classified as directory inodes and file inodes ○ data objects ■ Exists for file inodes only ● Relations: ○ inode objects to parent inode objects ○ data objects to file inode objects
  • 7. Dir Object File Object Data Data root File2 Dir2Dir1 File1 Object relations inodes/dinode File data 1 A CB D A D A Data Object . (‘root’) User View File1 Dir1 File2 Dir2
  • 8. RIO Distribution in a nutshell ● RIO stands for Relation Inherited Object distribution ○ or, a.was.k.a DHT2 ● Objects: ○ inode objects ■ further classified as directory inodes and file inodes ○ data objects ■ Exists for file inodes only ● Relations: ○ inode objects to parent inode objects ○ data objects to file inode objects ● Inheritance ○ file inode inherits location based on parent inode, ■ Directory inodes do not inherit this property, hence achieves distribution ○ data object inherits GFID/inode# of the file inode
  • 9. Dir Object File Object Object inheritance Metadata Ring (few bricks) Data Ring (many bricks) 1 A CB D A D Data Object Bricks/Subvols Data root File2 Dir2Dir1 File1 . (‘root’) User View File1 Dir1 File2 Dir2
  • 10. RIO Distribution in a nutshell (contd.) ● Salient principles: ○ Each directory belongs to a single RIO subvolume ○ Files inode and data is separated into different subvolumes ■ Thus, there are 2 rings for distribution of objects ○ Layout is per ring and is common to all objects in the ring ● Rationale: ○ Improve scalability and consistency ○ Retain, and in some cases, improve metadata performance ○ Improved rebalance
  • 11. Peeking into the RIO layer ● Consistency handling ○ Handle cross client operation consistency from a single point ○ Funnel entry operations, to the parent inode location, and metadata operations to the inode location ○ Hence, RIO is split as RIOc(lient) and RIOs(erver) ■ RIOc is a router for operations ■ RIOs manages FOP consistency and transaction needs ● Journals ○ Transactions need journals to avoid orphan inodes ○ Managing on-disk cache of size and time on inode, based on data operations, needs journals to track dirty inodes
  • 12. Rebuilding the Onion (DHT -> RIO) <prior xlators> DHT Client/Server Protocol POSIX xlator <prior xlators> Client/Server Protocol POSIX xlator RIO Client RIO Server Other intervening xlators Direct descendant in graph Changed layer Legend:
  • 13. Changed on-disk format ● New on disk backend, like the .glusterfs namespace on the bricks ● Reuse existing posix xlator for inode and fd OPs ○ redefine entry FOPs (and lookup) ● Re-add dentry backpointers ○ Needed for hard link and rename operations ● Extend utime xlator work, for caching, ○ time information (a/mtime) ○ size information (size, blocks, IO block) <prior xlators> Client/Server Protocol POSIX2 xlator RIO Client RIO Server
  • 14. Differentiating the 2 rings ● Server component on metadata (MDC) and data (DC) rings have differing functionality ○ Entry operations are all on MDC ○ Data operations are all on DC ○ Some other operations can happen across both (e.g fsync, xattrs) ● Orphan inodes are a MDC feature ● Dirty inodes are a MDC feature ● MDC needs to talk to DC for dirty inode cache updates <prior xlators> POSIX2 MDS xlator RIO Client RIO MDS Server POSIX2 DS xlator RIO DS Server
  • 15. Changing the abstractions ● iatt assumptions: ○ Operations on data and metadata would return incomplete iatt information ○ Looking at iatt2 or statx like extensions to disambiguate the same ● Anonfd: ○ Changes to data need to be reflected in the metadata cache (dirty inodes) ○ Hence active inodes need to be tracked on the metadata servers ○ Thus, ending the use of anonfd’s!?
  • 16. ● The MDC needs replication, and never disperse! ○ There is no data on the MDC ○ Only metadata needs to be replicated and made available Adding the availability layers ● Consistency handling by RIOs needs a leader, without which, ○ <n> RIOs instances on the replicas need to resolve who creates the inode and who links the name ○ locking becomes a cross replicate subvolume operation, reducing efficiency ● Brings LEX (Leader Election Xlator) and possibly DRC-like (Duplicate Request Cache, like xlator)
  • 17. Adding the availability layers <prior xlators> Client/Server (Leader) POSIX2 RIO Client RIO MDS Server Client/Server (Follower) Client/Server Protocol AFR AFR/Disperse RIO DS Server
  • 18. As we scale out... NOTE: Legend in slide 12 does not hold good for this image
  • 19. Thoughts on some other layers ● Unsupported ○ Quota, Tier (?) ● Sharding ○ Based on name distribution in DHT, RIO is based on GFID based distribution ■ Still gets a new GFID for every shard, but cost is higher with this methodology ○ Thinking of ways to distribute data, based on offsets, and leave the rest of the shard handling to its own xlator ● GFProxy does not need to change much, ○ Possibly DC can be have a proxy, and MDC can be without one, when, #MDS << #DS ● Layers and functionality that need to adapt ○ Tier(?), geo-rep
  • 20. Status ● Work happening on the experimental branch ○ RIOc/s(ds/mds), POSIX2 xlators being worked on ○ Current contributors: Kotresh, Susant, Shyam ● Abilities supported as of this writing, ○ Directory/File creation and data operations ○ xattr, stat for all inodes ● Missing abilities ○ Directory listing, unlink, rename ● Target is alpha release with gluster 4.0 ○ Tracked using this github issue #243