SlideShare ist ein Scribd-Unternehmen logo
1 von 12
“Write Caching” on GNR
Outline
Overview
Write path on GNR
Failure scenarios
2
Overview
GPFS Native RAID (GNR) implements a declustered RAID approach, in order to provide better disk management and
utilization, as compared to traditional storage methods.
However, users or customers who are familiar with the traditional storage methods raise questions and concerns using the
terminology they are familiar with.
The purpose of this presentation is to explain how the write path works on GNR, and how we have a huge write cache,
without really having a write cache.
Write path on GNR
There are several entities that might participate in the write operation
on GNR:
 Pagepool: Volatile pinned memory
 logTip: Non shared, mirrored, NVRAM based storage:
● Faster than SSD
● Replicated between GNR nodes using proprietary protocol
 logTipBackup: Shared, ssd based
● Used when one node is down
 logHome: shared, protected ( replicated) shared disks based storage
● Accessible by both shared nodes
 Home Location: The final destination of a data block.
●Those are the data or metadata vdisks used by the filesystem.
IB/ETH
GNR-1
pagepool
NVRAM
GNR-1
pagepool
NVRAM
Magnetic
Disk
SSD
SAS
Write path on GNR
There are several entities that might participate in the write operation
on GNR:
 Pagepool: Volatile pinned memory
 logTip: Non shared, mirrored, NVRAM based storage:
● Faster than SSD
● Replicated between GNR nodes using proprietary protocol
 logTipBackup: Shared, ssd based
● Used when one node is down
 logHome: shared, protected ( replicated) shared disks based storage
● Accessible by both shared nodes
 Home Location: The final destination of a data block.
●Those are the data or metadata vdisks used by the filesystem.
IB/ETH
GNR-1
pagepool
NVRAM
GNR-1
pagepool
NVRAM
Magnetic
Disk
SSD
SAS
logTip
logTipBackup
logHome
Home
Location
Write path on GNR - “full track writes”
In GNR, as in many traditional storage systems, full track or stripe writes bypass the write cache ( a.k.a write through). Using
the write cache for those type of writes, including the mirroring overhead, actually degrades performance in most cases.
The write operation is only acknowledged once the block is safe in its home location. Hence, there is no risk of losing data in
case of failure or a write through.
The data is still stored in the pagepool as a “read cache”.
Magnetic
Disk
GNR node
Full track write
nsd pagepool disk
write
write
Ack
Write path on GNR - “small writes”
●Writes that are small in size, typically less than full track, are treated differently. In such cases, a sophisticated combination
of various media types like NVRAM, SSD or Magnetic disks based journal ( log) mechanism are used. This enhances those
small writes performance without imposing a data loss risk.
●GNR uses a log-based data caching or recovery approach – the “fast write log”. The log is divided into the following two
types in order to better utilize the different storage characteristics:
 “ultra fast” logTip: Uses internal NVRAM on each node. The content is replicated on the other node's NVRAM using special protocol ( NSPD). The log tip contains
bursts of small writes.
 LogHome: The logHome represent another tier in the GNR logging mechanism. It uses magnetic disks to store batches of changes.
Write path on GNR - “small writes”
Small and other type of writes first arrive to the pagepool, and are then saved in the mirrored logTip. Once the write is saved,
an ack is sent to the NSD. This behavior guarantees that those small writes will be committed as fast as possible, and also
allow to optimize their order ( coalesce writes).
When the logTip gets full, or after a specified time threshold has elapsed, the data is moved to the logHome.
Note: The data is actually moved from the pagepool, and not from the logTip. The logHome write will usually be a large I/O as
it includes many small writes.
Later on the data is destaged from the pagepool to the home location.
Magentic
Disk
GNR node
Small write
nsd pagepool
Replicated
logTip
write
write
Ack
1
GNR node
NVRAM
2
3
Shared
logHome
disk
write
write
Magentic
Disk
Failure scenarios
Based on the explanation so far the data is only being written into the logs. This data is never being used, as all the write
operations are being made from the pagepool. The only case in which we use the logs are when we recover from a failure.
There are 2 major failure scenarios:
• Single node failure
• Dual node failure
Note: While there are other cases which GNR takes into account, they are outside the scope of this discussion
Failure scenarios – Full track writes
A full track write follows the write through model in which you don't need to worry about the cache content. However, it still
needs to provide a solution for failures in the middle of the write operation, known as torn writes.
In a full track write case, the GNR writes the new data to an unallocated space, and then logs the new location of the track.
In case of failure in the middle of the write, for example after writing 50% of the new data, the old content will be undisturbed.
Failure Scenarios – Single node failure
In case a node fails, the other node needs to continue to commit changes where the failed node left off. This is done using the
logs.
The recovery is on a per-recovery group basis.
The logTip is readable on the surviving node as it was mirrored, and the logHome is on the shared disk (accessible from the
other node).
During recovery, the node will read the uncommitted data from the logs and commit it to the spinning disks.
In case of a single node failure, to make sure that the logTip content is highly available, GNR writes the content of the logTip
into an unreplicated shared SSD (logTipBackup) to create a second copy of the logTip content. While the SSD is slower than
the NVRAM, it maintains a copy of the content, thus making the data available even during node failure. If this SSD fails as
well, GNR writes the content directly to the logHome.
Failure scenarios – dual node failure
In GNR, a dual node is equivalent to a complete failure of the storage unit.
If the GPFS above uses replication, and the overall system might still be operational, we don't need to care about the new
writes coming in.
When the system is brought back up, each node will read its own relevant log entry during the RG recovery in order to come
back to consistency.
As no dirty data exist on a non-volatile memory, no data is lost even in such a case.
Note: Dirty data is committed data that has not yet been written to its home location.

Weitere ähnliche Inhalte

Was ist angesagt?

Integration of Glusterfs in to commvault simpana
Integration of Glusterfs in to commvault simpanaIntegration of Glusterfs in to commvault simpana
Integration of Glusterfs in to commvault simpanaGluster.org
 
FreeNAS backup solution
FreeNAS backup solutionFreeNAS backup solution
FreeNAS backup solutiona3
 
Linux Locking Mechanisms
Linux Locking MechanismsLinux Locking Mechanisms
Linux Locking MechanismsKernel TLV
 
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline OptimalityErasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline OptimalityYue Cheng
 
Sharding: Past, Present and Future with Krutika Dhananjay
Sharding: Past, Present and Future with Krutika DhananjaySharding: Past, Present and Future with Krutika Dhananjay
Sharding: Past, Present and Future with Krutika DhananjayGluster.org
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)Roger Zhou 周志强
 
DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2Outlyer
 
Beyond 1000 bosh Deployments
Beyond 1000 bosh DeploymentsBeyond 1000 bosh Deployments
Beyond 1000 bosh Deploymentsanynines GmbH
 
Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Keisuke Takahashi
 
Cassandra运维之道 v0.2
Cassandra运维之道 v0.2Cassandra运维之道 v0.2
Cassandra运维之道 v0.2haiyuan ning
 
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Gluster.org
 
Comparison of-foss-distributed-storage
Comparison of-foss-distributed-storageComparison of-foss-distributed-storage
Comparison of-foss-distributed-storageMarian Marinov
 
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013Gluster.org
 
Health Check Your DB2 UDB For Z/OS System
Health Check Your DB2 UDB For Z/OS SystemHealth Check Your DB2 UDB For Z/OS System
Health Check Your DB2 UDB For Z/OS Systemsjreese
 
Distributed replicated block device
Distributed replicated block deviceDistributed replicated block device
Distributed replicated block deviceChanaka Lasantha
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph Community
 

Was ist angesagt? (20)

Shignled disk
Shignled diskShignled disk
Shignled disk
 
Integration of Glusterfs in to commvault simpana
Integration of Glusterfs in to commvault simpanaIntegration of Glusterfs in to commvault simpana
Integration of Glusterfs in to commvault simpana
 
FreeNAS backup solution
FreeNAS backup solutionFreeNAS backup solution
FreeNAS backup solution
 
Linux Locking Mechanisms
Linux Locking MechanismsLinux Locking Mechanisms
Linux Locking Mechanisms
 
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline OptimalityErasing Belady's Limitations: In Search of Flash Cache Offline Optimality
Erasing Belady's Limitations: In Search of Flash Cache Offline Optimality
 
Sharding: Past, Present and Future with Krutika Dhananjay
Sharding: Past, Present and Future with Krutika DhananjaySharding: Past, Present and Future with Krutika Dhananjay
Sharding: Past, Present and Future with Krutika Dhananjay
 
Bluestore
BluestoreBluestore
Bluestore
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)
 
DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2DOXLON November 2016: Facebook Engineering on cgroupv2
DOXLON November 2016: Facebook Engineering on cgroupv2
 
Beyond 1000 bosh Deployments
Beyond 1000 bosh DeploymentsBeyond 1000 bosh Deployments
Beyond 1000 bosh Deployments
 
Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5
 
Cassandra运维之道 v0.2
Cassandra运维之道 v0.2Cassandra运维之道 v0.2
Cassandra运维之道 v0.2
 
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
 
Comparison of-foss-distributed-storage
Comparison of-foss-distributed-storageComparison of-foss-distributed-storage
Comparison of-foss-distributed-storage
 
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
 
Strata - 03/31/2012
Strata - 03/31/2012Strata - 03/31/2012
Strata - 03/31/2012
 
Health Check Your DB2 UDB For Z/OS System
Health Check Your DB2 UDB For Z/OS SystemHealth Check Your DB2 UDB For Z/OS System
Health Check Your DB2 UDB For Z/OS System
 
Distributed replicated block device
Distributed replicated block deviceDistributed replicated block device
Distributed replicated block device
 
Ceph RBD Update - June 2021
Ceph RBD Update - June 2021Ceph RBD Update - June 2021
Ceph RBD Update - June 2021
 
Qemu gluster fs
Qemu gluster fsQemu gluster fs
Qemu gluster fs
 

Ähnlich wie Gnr writepath v1.0

Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems confluent
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data DeduplicationRedWireServices
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia DatabasesJaime Crespo
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed_Hat_Storage
 
C for Cuda - Small Introduction to GPU computing
C for Cuda - Small Introduction to GPU computingC for Cuda - Small Introduction to GPU computing
C for Cuda - Small Introduction to GPU computingIPALab
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios
 
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...peknap
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
 
Galvin-operating System(Ch9)
Galvin-operating System(Ch9)Galvin-operating System(Ch9)
Galvin-operating System(Ch9)dsuyal1
 
MySQL Server Backup, Restoration, and Disaster Recovery Planning
MySQL Server Backup, Restoration, and Disaster Recovery PlanningMySQL Server Backup, Restoration, and Disaster Recovery Planning
MySQL Server Backup, Restoration, and Disaster Recovery PlanningLenz Grimmer
 
Operating Systems: Device Management
Operating Systems: Device ManagementOperating Systems: Device Management
Operating Systems: Device ManagementDamian T. Gordon
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionStreamNative
 
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...Redis Labs
 
Quick introduction to Java Garbage Collector (JVM GC)
Quick introduction to Java Garbage Collector (JVM GC)Quick introduction to Java Garbage Collector (JVM GC)
Quick introduction to Java Garbage Collector (JVM GC)Marcos García
 

Ähnlich wie Gnr writepath v1.0 (20)

Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
 
Red Hat Gluster Storage Performance
Red Hat Gluster Storage PerformanceRed Hat Gluster Storage Performance
Red Hat Gluster Storage Performance
 
C for Cuda - Small Introduction to GPU computing
C for Cuda - Small Introduction to GPU computingC for Cuda - Small Introduction to GPU computing
C for Cuda - Small Introduction to GPU computing
 
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA SolutionsNagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions
 
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
Five steps perform_2009 (1)
Five steps perform_2009 (1)Five steps perform_2009 (1)
Five steps perform_2009 (1)
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Galvin-operating System(Ch9)
Galvin-operating System(Ch9)Galvin-operating System(Ch9)
Galvin-operating System(Ch9)
 
MySQL Server Backup, Restoration, and Disaster Recovery Planning
MySQL Server Backup, Restoration, and Disaster Recovery PlanningMySQL Server Backup, Restoration, and Disaster Recovery Planning
MySQL Server Backup, Restoration, and Disaster Recovery Planning
 
Operating Systems: Device Management
Operating Systems: Device ManagementOperating Systems: Device Management
Operating Systems: Device Management
 
Pulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless EvolutionPulsar Storage on BookKeeper _Seamless Evolution
Pulsar Storage on BookKeeper _Seamless Evolution
 
Dynomite @ Redis Conference 2016
Dynomite @ Redis Conference 2016Dynomite @ Redis Conference 2016
Dynomite @ Redis Conference 2016
 
ch10_massSt.pdf
ch10_massSt.pdfch10_massSt.pdf
ch10_massSt.pdf
 
Cluster filesystems
Cluster filesystemsCluster filesystems
Cluster filesystems
 
Shootout at the PAAS Corral
Shootout at the PAAS CorralShootout at the PAAS Corral
Shootout at the PAAS Corral
 
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
 
Quick introduction to Java Garbage Collector (JVM GC)
Quick introduction to Java Garbage Collector (JVM GC)Quick introduction to Java Garbage Collector (JVM GC)
Quick introduction to Java Garbage Collector (JVM GC)
 

Kürzlich hochgeladen

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Kürzlich hochgeladen (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Gnr writepath v1.0

  • 2. Outline Overview Write path on GNR Failure scenarios 2
  • 3. Overview GPFS Native RAID (GNR) implements a declustered RAID approach, in order to provide better disk management and utilization, as compared to traditional storage methods. However, users or customers who are familiar with the traditional storage methods raise questions and concerns using the terminology they are familiar with. The purpose of this presentation is to explain how the write path works on GNR, and how we have a huge write cache, without really having a write cache.
  • 4. Write path on GNR There are several entities that might participate in the write operation on GNR:  Pagepool: Volatile pinned memory  logTip: Non shared, mirrored, NVRAM based storage: ● Faster than SSD ● Replicated between GNR nodes using proprietary protocol  logTipBackup: Shared, ssd based ● Used when one node is down  logHome: shared, protected ( replicated) shared disks based storage ● Accessible by both shared nodes  Home Location: The final destination of a data block. ●Those are the data or metadata vdisks used by the filesystem. IB/ETH GNR-1 pagepool NVRAM GNR-1 pagepool NVRAM Magnetic Disk SSD SAS
  • 5. Write path on GNR There are several entities that might participate in the write operation on GNR:  Pagepool: Volatile pinned memory  logTip: Non shared, mirrored, NVRAM based storage: ● Faster than SSD ● Replicated between GNR nodes using proprietary protocol  logTipBackup: Shared, ssd based ● Used when one node is down  logHome: shared, protected ( replicated) shared disks based storage ● Accessible by both shared nodes  Home Location: The final destination of a data block. ●Those are the data or metadata vdisks used by the filesystem. IB/ETH GNR-1 pagepool NVRAM GNR-1 pagepool NVRAM Magnetic Disk SSD SAS logTip logTipBackup logHome Home Location
  • 6. Write path on GNR - “full track writes” In GNR, as in many traditional storage systems, full track or stripe writes bypass the write cache ( a.k.a write through). Using the write cache for those type of writes, including the mirroring overhead, actually degrades performance in most cases. The write operation is only acknowledged once the block is safe in its home location. Hence, there is no risk of losing data in case of failure or a write through. The data is still stored in the pagepool as a “read cache”. Magnetic Disk GNR node Full track write nsd pagepool disk write write Ack
  • 7. Write path on GNR - “small writes” ●Writes that are small in size, typically less than full track, are treated differently. In such cases, a sophisticated combination of various media types like NVRAM, SSD or Magnetic disks based journal ( log) mechanism are used. This enhances those small writes performance without imposing a data loss risk. ●GNR uses a log-based data caching or recovery approach – the “fast write log”. The log is divided into the following two types in order to better utilize the different storage characteristics:  “ultra fast” logTip: Uses internal NVRAM on each node. The content is replicated on the other node's NVRAM using special protocol ( NSPD). The log tip contains bursts of small writes.  LogHome: The logHome represent another tier in the GNR logging mechanism. It uses magnetic disks to store batches of changes.
  • 8. Write path on GNR - “small writes” Small and other type of writes first arrive to the pagepool, and are then saved in the mirrored logTip. Once the write is saved, an ack is sent to the NSD. This behavior guarantees that those small writes will be committed as fast as possible, and also allow to optimize their order ( coalesce writes). When the logTip gets full, or after a specified time threshold has elapsed, the data is moved to the logHome. Note: The data is actually moved from the pagepool, and not from the logTip. The logHome write will usually be a large I/O as it includes many small writes. Later on the data is destaged from the pagepool to the home location. Magentic Disk GNR node Small write nsd pagepool Replicated logTip write write Ack 1 GNR node NVRAM 2 3 Shared logHome disk write write Magentic Disk
  • 9. Failure scenarios Based on the explanation so far the data is only being written into the logs. This data is never being used, as all the write operations are being made from the pagepool. The only case in which we use the logs are when we recover from a failure. There are 2 major failure scenarios: • Single node failure • Dual node failure Note: While there are other cases which GNR takes into account, they are outside the scope of this discussion
  • 10. Failure scenarios – Full track writes A full track write follows the write through model in which you don't need to worry about the cache content. However, it still needs to provide a solution for failures in the middle of the write operation, known as torn writes. In a full track write case, the GNR writes the new data to an unallocated space, and then logs the new location of the track. In case of failure in the middle of the write, for example after writing 50% of the new data, the old content will be undisturbed.
  • 11. Failure Scenarios – Single node failure In case a node fails, the other node needs to continue to commit changes where the failed node left off. This is done using the logs. The recovery is on a per-recovery group basis. The logTip is readable on the surviving node as it was mirrored, and the logHome is on the shared disk (accessible from the other node). During recovery, the node will read the uncommitted data from the logs and commit it to the spinning disks. In case of a single node failure, to make sure that the logTip content is highly available, GNR writes the content of the logTip into an unreplicated shared SSD (logTipBackup) to create a second copy of the logTip content. While the SSD is slower than the NVRAM, it maintains a copy of the content, thus making the data available even during node failure. If this SSD fails as well, GNR writes the content directly to the logHome.
  • 12. Failure scenarios – dual node failure In GNR, a dual node is equivalent to a complete failure of the storage unit. If the GPFS above uses replication, and the overall system might still be operational, we don't need to care about the new writes coming in. When the system is brought back up, each node will read its own relevant log entry during the RG recovery in order to come back to consistency. As no dirty data exist on a non-volatile memory, no data is lost even in such a case. Note: Dirty data is committed data that has not yet been written to its home location.