3. Overview
GPFS Native RAID (GNR) implements a declustered RAID approach, in order to provide better disk management and
utilization, as compared to traditional storage methods.
However, users or customers who are familiar with the traditional storage methods raise questions and concerns using the
terminology they are familiar with.
The purpose of this presentation is to explain how the write path works on GNR, and how we have a huge write cache,
without really having a write cache.
4. Write path on GNR
There are several entities that might participate in the write operation
on GNR:
Pagepool: Volatile pinned memory
logTip: Non shared, mirrored, NVRAM based storage:
● Faster than SSD
● Replicated between GNR nodes using proprietary protocol
logTipBackup: Shared, ssd based
● Used when one node is down
logHome: shared, protected ( replicated) shared disks based storage
● Accessible by both shared nodes
Home Location: The final destination of a data block.
●Those are the data or metadata vdisks used by the filesystem.
IB/ETH
GNR-1
pagepool
NVRAM
GNR-1
pagepool
NVRAM
Magnetic
Disk
SSD
SAS
5. Write path on GNR
There are several entities that might participate in the write operation
on GNR:
Pagepool: Volatile pinned memory
logTip: Non shared, mirrored, NVRAM based storage:
● Faster than SSD
● Replicated between GNR nodes using proprietary protocol
logTipBackup: Shared, ssd based
● Used when one node is down
logHome: shared, protected ( replicated) shared disks based storage
● Accessible by both shared nodes
Home Location: The final destination of a data block.
●Those are the data or metadata vdisks used by the filesystem.
IB/ETH
GNR-1
pagepool
NVRAM
GNR-1
pagepool
NVRAM
Magnetic
Disk
SSD
SAS
logTip
logTipBackup
logHome
Home
Location
6. Write path on GNR - “full track writes”
In GNR, as in many traditional storage systems, full track or stripe writes bypass the write cache ( a.k.a write through). Using
the write cache for those type of writes, including the mirroring overhead, actually degrades performance in most cases.
The write operation is only acknowledged once the block is safe in its home location. Hence, there is no risk of losing data in
case of failure or a write through.
The data is still stored in the pagepool as a “read cache”.
Magnetic
Disk
GNR node
Full track write
nsd pagepool disk
write
write
Ack
7. Write path on GNR - “small writes”
●Writes that are small in size, typically less than full track, are treated differently. In such cases, a sophisticated combination
of various media types like NVRAM, SSD or Magnetic disks based journal ( log) mechanism are used. This enhances those
small writes performance without imposing a data loss risk.
●GNR uses a log-based data caching or recovery approach – the “fast write log”. The log is divided into the following two
types in order to better utilize the different storage characteristics:
“ultra fast” logTip: Uses internal NVRAM on each node. The content is replicated on the other node's NVRAM using special protocol ( NSPD). The log tip contains
bursts of small writes.
LogHome: The logHome represent another tier in the GNR logging mechanism. It uses magnetic disks to store batches of changes.
8. Write path on GNR - “small writes”
Small and other type of writes first arrive to the pagepool, and are then saved in the mirrored logTip. Once the write is saved,
an ack is sent to the NSD. This behavior guarantees that those small writes will be committed as fast as possible, and also
allow to optimize their order ( coalesce writes).
When the logTip gets full, or after a specified time threshold has elapsed, the data is moved to the logHome.
Note: The data is actually moved from the pagepool, and not from the logTip. The logHome write will usually be a large I/O as
it includes many small writes.
Later on the data is destaged from the pagepool to the home location.
Magentic
Disk
GNR node
Small write
nsd pagepool
Replicated
logTip
write
write
Ack
1
GNR node
NVRAM
2
3
Shared
logHome
disk
write
write
Magentic
Disk
9. Failure scenarios
Based on the explanation so far the data is only being written into the logs. This data is never being used, as all the write
operations are being made from the pagepool. The only case in which we use the logs are when we recover from a failure.
There are 2 major failure scenarios:
• Single node failure
• Dual node failure
Note: While there are other cases which GNR takes into account, they are outside the scope of this discussion
10. Failure scenarios – Full track writes
A full track write follows the write through model in which you don't need to worry about the cache content. However, it still
needs to provide a solution for failures in the middle of the write operation, known as torn writes.
In a full track write case, the GNR writes the new data to an unallocated space, and then logs the new location of the track.
In case of failure in the middle of the write, for example after writing 50% of the new data, the old content will be undisturbed.
11. Failure Scenarios – Single node failure
In case a node fails, the other node needs to continue to commit changes where the failed node left off. This is done using the
logs.
The recovery is on a per-recovery group basis.
The logTip is readable on the surviving node as it was mirrored, and the logHome is on the shared disk (accessible from the
other node).
During recovery, the node will read the uncommitted data from the logs and commit it to the spinning disks.
In case of a single node failure, to make sure that the logTip content is highly available, GNR writes the content of the logTip
into an unreplicated shared SSD (logTipBackup) to create a second copy of the logTip content. While the SSD is slower than
the NVRAM, it maintains a copy of the content, thus making the data available even during node failure. If this SSD fails as
well, GNR writes the content directly to the logHome.
12. Failure scenarios – dual node failure
In GNR, a dual node is equivalent to a complete failure of the storage unit.
If the GPFS above uses replication, and the overall system might still be operational, we don't need to care about the new
writes coming in.
When the system is brought back up, each node will read its own relevant log entry during the RG recovery in order to come
back to consistency.
As no dirty data exist on a non-volatile memory, no data is lost even in such a case.
Note: Dirty data is committed data that has not yet been written to its home location.