SlideShare a Scribd company logo
1 of 66
Download to read offline
The Forefront of the Development for NVDIMM
on Linux Kernel (Linux Plumbers conf. ver.)
2021/Sep/23
Yasunori Goto (Fujitsu Limited.)
Ruan Shiyang (Nanjing Fujitsu Nanda Software Technology Co., Ltd.)
Copyright 2021 FUJITSU LIMITED
0
Agenda
◼ Summary of current status of developmentfor NVDIMM (Yasunori)
◼ Basis of NVDIMM on Linux
◼ Issues of Filesystem-DAX (Direct Access mode)
◼ Deep dive to solve the issues of Filesystem-DAX(Ruan)
◼ Support reflink & dedupe for fsdax
◼ Fix NVDIMM-based Reverse mapping
◼ Conclusion
Copyright 2021 FUJITSU LIMITED
1
Self introduction
◼ Yasunori Goto
◼ I have worked for Linux and related OSS since 2002
• Development for Memory hotplug feature of Linux Kernel
• Technical Support for troubles of Linux Kernel
• etc.
◼ Currently, leader of Fujitsu Linux Kernel Developmentteam
◼ In the last few years, I have mainly worked for NVDIMM
• some enhancement for RAS of NVDIMM
• For Fault location feature
• For Fault predictionfeature
“The ideal and reality of NVDIMM RAS”
https://www.slideshare.net/ygotokernel/the-ideal-and-reality-of-nvdimm-ras-newer-version
Copyright 2021 FUJITSU LIMITED
2
Basis of NVDIMM on Linux
Copyright 2021 FUJITSU LIMITED
3
Characteristics of Non-Volatile DIMM (NVDIMM)
◼ Persistent memory device which can be inserted to DIMM slot like DRAM
◼ CPU can read/write NVDIMM directly
◼ It can keep data persistency even if system is powered down or rebooted
◼ Latency, capacity, and cost have characteristics between DRAM and NVMe
◼ Use case
◼ Example
• In memory Database
• Hierarchical storage, distributedstorage
• Key-Value-store
◼ Famous Product
◼ Intel Data Center Persistent Memory Module (DCPMM)
Copyright 2021 FUJITSU LIMITED
CPU
cache
DRAM
NVDIMM
NVMe
SSD (SAS/SATA)
HDD
Capacity
Latency cost
4
Impact of NVDIMM
Copyright 2021 FUJITSU LIMITED
Page cache
System call
(Switch to kernel mode)
Sync()/fsync()/msync()
It was created for SLOW I/O storage, but
NVDIMM is fast enough without page cache
If user process calls cpu flush instruction,
then it is enough to make persistency
• Application can read/write to NVDIMM directly
• Even system call may be waste of time due to
switching between kernel mode and user mode
Traditional I/O layer becomes redundant for NVDIMM
New interface is expected for NVDIMM!
5
◼ Many software assumes that memory is VOLATILE yet
NVDIMM is difficult for traditional software
Copyright 2021 FUJITSU LIMITED
• In older CPU, its cache is still volatile
• If system power down suddenly,then some
data may not be stored
Need to prepare for sudden powerdown
• Should not change data structures in NVDIMM
• If the structures are changed, software update
will be cause of disaster
• If the data is broken, software need todetect
it and correct it
• Software need to assign not only free area, but
also used area for reuse its data
• In addition, kernel must assign the area to
suitable process with authority check
Need data structure compatibility on NVDIMM
Need to detect / correct collapsing data Need data area management
What will be necessary for a software to use NVDIMM?
6
Confliction of requirements
Copyright 2021 FUJITSU LIMITED
Filesystem
is still useful
Filesystem
is too slow
• Traditional I/O stack is too heavy
• CPU cache flush is enough to make persistency
• Need new access interface to NVDIMM for
next generation software
• Filesystem gives many solution for the
previous considerations
• It is useful for current software
• Format compatibility of filesystem
• Data correction
• Journaling, or CoW
• region management
• authority check
• etc...
• call system call
• page cache
• block device layer
• etc…
7
Interfaces of NVDIMM (1/3)
◼ Because of the previous reasons, Linux provides some interfaces for application
◼ Storage Access(green)
• Applicationcan access NVDIMM
with traditional I/O IFlike SSD/HDD
• So, application can use this mode
without any modification
◼ Filesystem DAX(blue)
• Page cache is skippedwhen you use
read()/write() on Filesystem-DAX
• Applicationcan access NVDIMM area
directly if it calls mmap() for a file
• Need filesystem support
• Xfs, ext4…
• This mode is suitable for modifying
current applications for NVDIMM.
Copyright 2021 FUJITSU LIMITED
DAX enabled
filesystem
NVDIMM
user layer
kernel layer
Pmem driver device dax driver
/dev/dax
New App with Device
DAX
BTT driver
Traditional
filesystem
Traditional App
New App with
Filesystem DAX
PMDK
management
command
sysfs
ACPI driver
ACPI
8
Interfaces of NVDIMM (2/3)
◼ Because of the previous reasons, Linux provides some interfaces for application
◼ Device DAX(red)
• Applicationcan access NVDIMM area
directly if it calls mmap() for /dev/dax
• /dev/dax allows only open(), mmap(),
and close()
• IOW, you can not use read()/write()
nor any other system call
• For innovative newapplicationwith
NVDIMM
◼ PMDK(purple) is provided
• Set of convenient libraries and tools
for filesystem DAX and Device DAX
• Transaction support for pmem applications
• Pool management in the DAX file/device
• Etc.
Copyright 2021 FUJITSU LIMITED
DAX enabled
filesystem
NVDIMM
user layer
kernel layer
Pmem driver device dax driver
/dev/dax
New App with Device
DAX
BTT driver
Traditional
filesystem
Traditional App
New App with
Filesystem DAX
PMDK
management
command
sysfs
ACPI driver
ACPI
9
Interfaces of NVDIMM(3/3)
◼ NVDIMM is shown as device files like storage
◼ For Storage Access : /dev/pmem##s (# means number)
◼ Filesystem DAX: /dev/pmem##
◼ Device DAX: /dev/dax##.#
◼ ndctl(*) can create these device when it creates namespace
◼ Example of Filesystem DAX
◼ You can make filesystem
on /dev/pmem##s
or /dev/pmem##
(*) a set of tools/commands for NVDIMM
◼ Note: /dev/dax##.# is character device
◼ Since you cannot use read()/write() for /dev/dax##.#, you cannot use dd command for backup
◼ You need to daxio command of PMDK instead of it
Copyright 2021 FUJITSU LIMITED
$ sudo ndctl create-namespace -e "namespace0.0" -m fsdax -f
{
"dev":"namespace0.0",
"mode":"memory",
"size":"5.90 GiB (6.34 GB)",
"uuid":"dc47d0d7-7d8f-473e-9db4-1c2e473dbc8f",
"blockdev":"pmem0",
"numa_node":0
}
$ sudo ls /dev/pmem*
/dev/pmem0
10
Filesystem-DAX is still experimental status
Copyright 2021 FUJITSU LIMITED
• The management way of NVDIMM is almost same with traditionalfilesystem
• Operator can use traditional command tomanage NVDIMM area
• Not only application can access NVDIMM area directly, but alsoit can use traditional systemcall
Filesystem-DAX is very expected interface…
• The “experimental” message is shown when the filesystem is mountedwith DAX option
• There are difficult issues in kernel layer for some years
But it is still experimental….
• In contrast, Device DAX requires pool management by tools of PMDK
• Otherwise, a software need toposses whole of the namespace (/dev/dax)
• In addition, Application can NOT use many systemcall in Device DAX
What is the reason?
11
Issues of Filesystem-DAX
◼ Whatis solved, and whatis current issues
Copyright 2021 FUJITSU LIMITED
12
Why Filesystem-DAX is experimental?
◼ In summary, there are 2 big reasons
Copyright 2021 FUJITSU LIMITED
Filesystem DAX combines storage and memory characteristics
• This causes corner-case issues of Filesystem-DAX
• They are often difficult problem
More additional features were required,
but they are/were difficult to make
• Configure DAX on/off for each inode (directory or file)
• Co-existence with CoW filesystem
13
Corner Case Issue 1 : Update metadata(1/3)
Copyright 2021 FUJITSU LIMITED
The first problem is “updating metadata” of the file
metadata
data
In Filesystem-DAX, we expected
that application can make
persistency of data with only cpu
cache flush as I said….
However, this also means there is no
chance to updatemetadata by
kernel/filesystem
file
• Update time of the file may not be correct
• If an application use write some data to file on the filesystem DAX, and a user remove
some blocks of the file by truncate(2), kernel cannot negotiate it
• Data of the file may be lost
• If data transferred by DMA/RDMA to the page which is allocated as filesystem DAX,
similar problem may occur
14
Corner Case Issue 1 : Update metadata(2/3)
◼ Current Status of update metadata problems
Copyright 2021 FUJITSU LIMITED
• Solved by introducing new
MAP_SYNC flag of
mmap()
• Page fault occurs every
write access, then kernel
can update meta data
• PMDK specifies this flag
• Solved by waiting
truncate() until finishing
RMDA
• Not solved
• Truncate() can not wait the
completion of transfer,
because it may too long
time
General write access DMA/RDMA data access
Kernel/driver layer
user process layer
(E.g. infiniband,
video(v4l2))
15
Corner Case Issue 1 : Update metadata(3/3)
◼ Current Status of update metadata problems
Copyright 2021 FUJITSU LIMITED
• Sloved by introducing new
MAP_SYNC flag of
mmap()
• Page fault when write
access, then kernel
update meta data
• PMDK specifies this flag
• Solved by waiting
truncate() until finishing
RMDA
• Not solved
• Truncate() can not wait the
completion of transfer,
because it may too long
time
General write access DMA/RDMA data access
Kernel/driver layer
user process layer
(E.g. infiniband,
video(v4l2))
• In ODP, usually driver/hardware does not map the
pages of DMA/RDMA area for application
• It maps the pages when application accesses them
• Kernel/driver can coordinate metadata at the time
• Mellanox(NVIDIA) newer card has the feature
Workaround:
“On Demand Paging(ODP)”
16
Corner Case Issue 2: unbind(1/2)
◼ Unbind is a sysfs interface to disconnect / hot-remove a device
◼ Each device driver provides its handler for it
◼ Though NVDIMM is not hotplug device physically, its interface can be used to disable
and switch the mode of NVDIMM namespace
• Ex)
• To change namespace mode from Filesystem-DAX toDevice-DAX
• To allowthat users can NVDIMM like normal RAM
• Etc.
Copyright 2021 FUJITSU LIMITED
Example of“how to use Device DAX namespace like a normal RAM”
1) Remove the device from “Device DAX”infrastructure
# echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id
# echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
2) Bind it to kmem driver
# echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id
# echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind
17
Corner Case Issue 2: unbind(2/2)
◼ Unbind is likely “surprising remove” interface
◼ There is no way to fail of unbind even if a user is using it
• It must be disabled forcibly
https://lore.kernel.org/linux-btrfs/CAPcyv4g3ZwbdLFx8bqMcNvXyrob8y6sBXXu=xPTmTY0VSk5HCw@mail.gmail.com/
◼ To solve this problem, Filesystem-dax needs to disable a range of NVDIMM
area immediately
◼ Currently, this is not solved yet
◼ It will be solved after the end of Ruan's work which will be talked by him today
• His new code will help tosolve it
Copyright 2021 FUJITSU LIMITED
A race condition was reported between Filesystem-dax and unbind in 2021/Feb
18
DAX on/off for each inode (directory or file) (1/3)
◼ Expected use-cases
Copyright 2021 FUJITSU LIMITED
Need more fine-grain settings
Change DAX attribute by application
Performance tuning
Workaround when Filesystem-DAX has a bug
Configuration is always painful for administrator
If application can detect and change it, it will be helpful for them
Users may want to change the DAX mode depending on each file
Since the write latency of NVDIMM is a bit slower than RAM,
User may want to use page cache by DAX off setting
19
DAX on/off for each inode (directory or file) (2/3)
◼ What was the problem
◼ If filesystem change DAX attribute, filesystem need to change methods of filesystem between
DAX and normal file, but they may be executed yet
◼ Data of the page cache must be moved silently when the dax attribute becomes off
• These problems were very difficult
Copyright 2021 FUJITSU LIMITED
Page caches
On DRAM
xfs_vm_writepages()
xfs_iomap_set_page_dirty()
:
:
Methods for DAX file
Methods for normal file
Need change xfs_dax_writepages()
noop_set_page_dirty()
:
:
struct address
space
inode
struct address
space
inode
Nopage
cache
Normal file DAX file
20
DAX on/off for each inode (directory or file) (3/3)
◼ Fortunately, this issue was solved
◼ The DAX attribute is changed only when its inode cache is NOT loaded on memory
• Filesystem can load suitable methods for each attribute when it reloads inode tomemory
• Page caches of the file are alsodropped
• Users can use this feature with the newmount option. #mount … -o dax=inode
• The DAX attribute is changedby command
◼ Note
• All of applications which use the target file must close it to change the dax attribute
• Filesystem will postpone changing the DAX attribute until dropping inode cache and page cache of the file
Copyright 2021 FUJITSU LIMITED
Normal file DAX file
drop inode
cache
drop inode
cache
reload
reload
DAX on : $xfs_io -c 'chattr +x’ <file or directory>
DAX off: $xfs_io -c 'chattr -x’<file or directory>
21
Coexisting with CoW (reflink/dedup) filesystem (1/2)
◼ The Copy on Write feature of filesystem (xfs:reflink/dedup, btrfs)
◼ If there is a same data block on different files, filesystem can merge it as same block
◼ So far, if only filesystem manages such block, it was enough
• Since a page cache is allocated for each file of the block, memory management layer don’t need toknowit
◼ In Filesystem-DAX, it becomes problems
• Merged block equals mergedmemory itself, it affects the memory failure case
Copyright 2021 FUJITSU LIMITED
File A File B
offset 100 offset 200
22
Coexisting with CoW (reflink/dedup) filesystem (2/2)
◼ Problems
Copyright 2021 FUJITSU LIMITED
Ruan-san will talk how to solve them
• Currently, there is no implementation of
reflink/dedupefor Filesystem-DAX
• Iomap, which is newer io block
layer instead of buffer_head, has
interface for CoW filesystem
• XFS filesystem DAX also uses iomap
• But there is no code to use CoW
and DAX at the same time
Need actual CoW implementation
for FilesystemDAX
• When a memory failure occurs, need to
kill processes which use the memory
• To achieve it, kernel need to find all
processes from the merged page/block
• But a merged page has only one
struct page
• No space for plural files in it
Needs to chase plural files from a
merged page/block
23
Deep dive to solve the issue of Filesystem-DAX
◼ Supportreflink/dedupe for FSDAX*
◼ ImproveNVDIMM-basedReverse mapping
* Filesystem-DAX
Copyright 2021 FUJITSU LIMITED
24
Self-introduction
◼ Ruan Shiyang
◼ A Software Engineer of Fujitsu Nanda
◼ Experiencein Embedded development
◼ Currently focusing on Linux filesystem and persistentmemory
Copyright 2021 FUJITSU LIMITED
25
Create a XFS (reflink is enabled by default) and mount it with “dax” option
Then error occurs
The dmesg shows the reason
Background
◼ fsdax is “EXPERIMENTAL”
◼ reflink and fsdax cannot work together
Copyright 2021 FUJITSU LIMITED
$ mkfs.xfs /dev/pmem0 && mount –o dax /dev/pmem0 /mnt
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/pmem0,
missing codepage or helper program, or other error.
XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
XFS (pmem0): DAX and reflink cannot be used together!
26
reflink
◼ Files share extents for same data
◼ Advantages
◼ Fast copy
◼ Save storage
◼ Copy on Write mechanism (CoW)
◼ Copy the shared extentsbefore writing data
Copyright 2021 FUJITSU LIMITED
File A File B
Extent I
File A
Extent I Extent II
copy
File A File B
Extent I
File A
Extent I
File B
File A
Extent I
write
File B
File A
Extent I
File B
File A
Extent I
Copy then Write
File
Disk
File
Disk
File
Disk
copy
--reflink=always
27
fsdax (Filesystem DAX)
◼ A mode of a NVDIMM namespace
◼ No page cache in I/O path
◼ Allows direct mappings to persistent memory media
Copyright 2021 FUJITSU LIMITED
read()
write()
mmap()
vfs
SSD/HDD
page
cache
NVDIMM
direct access (DAX)
28
Why reflink and fsdax cannot work together?
◼Issues
1. Support reflink/dedupe for FSDAX
◼ Extent iomap interface
◼ Support CoW in fsdax
◼ Support Dedupe in fsdax
2. Improve NVDIMM-based Reverse mapping
◼ Support 1-to-N Reversed mapping for NVDIMM
Copyright 2021 FUJITSU LIMITED
29
Support reflink/dedupe for FSDAX
◼ Difference between Buffered-IO vs. FSDAX
◼ What must be implemented?
◼ How are they implemented?
Copyright 2021 FUJITSU LIMITED
30
Kernel
page
cache
Comparison: Buffered-IO vs. FSDAX write() (1/2)
Copyright 2021 FUJITSU LIMITED
Sync to disk
Get destination from XFS
( AllocateDelay extents )
CoW
Read from disk
Write data
Disk SSD / HDD
User Space write()
◼ Requires page cache
1. Read(Copy)destinationdata
- from disk
- to page cache
2. Write user data
- from userspace
- to page cache
3. Sync page cache to disk
- remap CoWextents
The progress of using page cache
indicate the CoW operation.
1
2
Mark dirty
iomap_begin()
xfs_buffered_write_iomap_begin()
iomap_end()
xfs_buffered_write_end()
actor()
iomap_write_actor()
Clean up if write fails
Buffered-IO
3
31
Kernel
Comparison: Buffered-IO vs. FSDAX write() (2/2)
Copyright 2021 FUJITSU LIMITED
Disk NVDIMM
User Space write()
◼ No page cache
- directly access to NVDIMM
1. Not Copy
- get NVDIMM addressonly
2. Directly Write
- user datato NVDIMM
- no need tosync
- cannot remap extents
FSDAX
iomap_begin()
xfs_direct_write_iomap_begin()
iomap_end()
NULL
actor()
dax_iomap_actor()
Get NVDIMM address
from direct_access()
Get destination from XFS
( Allocateimmediateextents )
Write data
1
2
CoW is missing
32
Kernel / FSDAX
What must be implemented?
◼ Extent iomap interface
◼ Introduce ‘srcmap’
◼ Fill ‘iomap’ & ‘srcmap’ for fsdax
◼ SupportCoW in fsdax
◼ Add CoW for write()
◼ Add CoW for mmap()
◼ Remap extents after CoW
◼ SupportDedupe in fsdax
◼ Add a ‘dax’ deduplication
Copyright 2021 FUJITSU LIMITED
• ALLOCATE new extent for CoW
• STORE source extent info
• COPY source data to destination
• WRITE user data to NVDIMM
• REMAP the extents of this file
Enhancement in iomap framework
iomap_begin()
xfs_direct_write_iomap_begin()
iomap_end()
xfs_dax_write_iomap_end()
actor()
dax_iomap_actor()
33
iomap interface - Introduce ‘srcmap’
◼ What is necessary
◼ Require source info for CoW
• Only use ‘iomap’ to tell destination to write
• but CoW requires more info
• source bno: where to copy from
• source length: howmuch tocopy
◼ How?
◼ Introduce another struct called ‘srcmap’
• remember & pass the source info
Copyright 2021 FUJITSU LIMITED
File A
Extent I
copy
File B
* Introduced by Goldwyn Rodrigues: [PATCH] iomap: use a srcmapfor a read-modify-writeI/O
‘iomap’
• destination bno
• destination length
‘srcmap’
• source bno
• source length
34
Kernel: FSDAX / XFS
CoW
iomap interface – Fill ‘iomap’ & ‘srcmap’
◼ What is necessary
◼ Store source info in ‘srcmap’
• Only fill ‘iomap’ in ->iomap_begin()
• Implementedby filesystem
• Need to fill ‘srcmap’ for shared extent
◼ How?
◼ Add CoW branch
• Allocate new extent for CoW
• Fill ‘srcmap’ with destination extent
• Fill ‘iomap’ with new extent
• Set IOMAP_F_SHARED flag
Copyright 2021 FUJITSU LIMITED
get destination extent
if shared?
if HOLE?
Y
Y
N
allocate new extent
fill ‘iomap’ with new extent
set IOMAP_F_SHARED
iomap_begin()
xfs_direct_write_iomap_begin()
fill ‘srcmap’ with destination
35
Kernel: FSDAX
CoW
Support CoW - Add CoW for write()
◼ What is necessary
◼ Execute CoW in write() path
• Only write user data to destination
• Use ->direct_access() totranslate address
- From: offset in block device
- To: physical memory address in NVDIMM
• Need a pre-copy before writing
• src_addr: (translated) source address
• dest_addr: (translated) destination address
◼ How?
◼ Add CoW branch
• Copy source data from ‘srcmap’ to NVDIMM
Copyright 2021 FUJITSU LIMITED
Y
get srcmap
if CoW?
copy source data to dest_addr
->direct_access() => src_addr
N
write user data to dest_addr
actor()
dax_iomap_actor()
->direct_access() => dest_addr
from ‘iomap’
from ‘srcmap’
36
Kernel: FSDAX page fault
CoW
Support CoW - Add CoW for mmap()
Copyright 2021 FUJITSU LIMITED
Y
get srcmap
if CoW?
copy source data to dest_addr
->direct_access() => src_addr
N
associate VMA with PFN
->direct_access() => PFN & dest_addr
from ‘iomap’
from ‘srcmap’
◼ What is necessary
◼ Execute CoW in page fault
• FSDAX has its own specific PTE&PMDfault
• Use iomap framework
• Only find destinationPFN &associate VMA
• Need a pre-copy before associating
• PFN: destination PFN found by ->direct_access()
• src_addr: (translated) source address
• dest_addr: (translated) destination address
◼ How?
◼ Add CoW branch in PTE&PMD fault
• Copy source data before VMA is associated
• Prepare for writing user data in userspace
37
Kernel: FSDAX / XFS
iomap_end()
xfs_dax_write_iomap_end()
Support CoW - Remap extents after CoW
◼ What is necessary
◼ Remap extents for CoW
• The new extent not been mapped
• Metadata not been updated
• File won’t contain the CoW extent
• Need to update metadata after CoW
◼ How?
◼ Add remap in ->iomap_end()
• Do remap operation if is CoW
• Clean up if CoW operation fails
Copyright 2021 FUJITSU LIMITED
Y
if CoW?
remap extents
N
if error?
Y
clean up CoW extents
Nothing to do
N
38
Support dedupe - Add a ‘dax’ deduplication
◼ What is necessary
◼ Deduplicate DAX files
• Dedupe: reduce redundantdata on storage costs
• Require a dedupe function
• Only have generic dedupe function
➢ compare data in page caches
• Not adapted to FSDAX
➢ no page cache
➢ directly compare data in NVDIMM
◼ Need a new dax compare function
◼ How?
◼ Introduce a DAX compare function
• Compare data by memcmp()
Copyright 2021 FUJITSU LIMITED
File A
compared same
File B
Extent I Extent II
DAX compare function
Kernel: FSDAX / Dedupe
FileA
->direct_access() => a_addr
FileB
->direct_access() => b_addr
memcmp(a_addr, b_addr, len)
result: same or not
39
Improve the current NVDIMM-based Reverse
mapping
◼ Why Reverse mapping needs to be improved
◼ Struggling to solve this issue
◼ What must be implemented?
◼ How are they implemented?
Copyright 2021 FUJITSU LIMITED
40
NVDIMM
When Memory failure occurs
Copyright 2021 FUJITSU LIMITED
file A
process 1
process 2
process 3
process 4
0xA000
0xA001
0xA002
0xA004
0xA005
0xA003
1. track all processes associating with the broken page on NVDIMM
2. send signal to kill those associated processes
broken
41
Need to improve NVDIMM-based Reverse mapping
◼ currently only support 1-to-1 mapping
◼ reflink on fsdax requires 1-to-N mapping
Copyright 2021 FUJITSU LIMITED
NVDIMM
0xA000
0xA001
0xA002
0xA004
0xA005
file A
process 1
process 2
process 3
VMA
VMA
VMA
file B
process 4
process 5
VMA
VMA
->mapping = mappingA
->index= offsetA
file … …
…
0xA003
Not supportedyet
VMA
VMA
broken
42
Struggling to solve this issue
◼ First idea was simple, but it was bad idea
◼ Simply make rbtree for 1-to-N rmap
◼ Current strategy after some struggles
◼ Chase filesystem internal to find 1-to-N relationship
◼ What is difficulty of this way?
• Memory failure information is basically page unit.
But we need to find where it is in filesystem
• Filesystem may be created on partition, and/or LVM,
it affects relative offset in the filesystem
Copyright 2021 FUJITSU LIMITED
No! Cause of huge over head
see next page
First idea
mappingA
offsetA
mappingB
offsetB
mappingC
offsetC
rbroot
refcount
page->private
page->index
mappingA
indexA
page->mapping
page->index
NVDIMM
file A
0xA0400
43
Enhanced 1-to-N NVDIMM-basedRMAP
File2
File3
Introduce 1-to-N NVDIMM-based RMAP
Copyright 2021 FUJITSU LIMITED
PMEM
Middle Layers Middle Layers
XFS EXT4
File1
System Memory
…
memory
failure
#MCE (Machine Check Exception)
Sharing DAX file
ProcessA
Kill now
ProcessB
Kill later
1. MCE triggers
◼ memory-failure
2. 1-to-N RMAP through
◼ mm layer
◼ device driver
◼ block device layer
◼ filesystem layer
◼ files
◼ processes
3. Kill processes according to related file
◼ Kill current immediately
◼ Kill others later
1
2
3
44
System Memory
What must be implemented?
1. RMAP from NVDIMM driver to dax device
2. RMAP from dax device to filesystem
◼ Introduce dax_holder registration mechanism
◼ Types of holder
• Filesystem
• Partition
• Mapped Device
3. RMAP from filesystem to file
◼ Require rmapbt feature
◼ Improve process collection and killing for FSDAX
4. Compatibility for no-reflink / no-rmapbt filesystem
◼ e.g., EXT4 does not support reflink & rmapbt
Copyright 2021 FUJITSU LIMITED
PMEM
Middle Layers
XFS
Middle Layers
EXT4
File
…
RMAP through all layers
1
2
File2
File3
4
3
45
1. RMAP from NVDIMM driver to dax device
Copyright 2021 FUJITSU LIMITED
◼ Translate the PFN number into offset within a dax device
◼ PMEM driver (FSDAX [/dev/pmem0])
• Linear offset translation
◼ Dax driver (DEVDAX [/dev/dax0.0])
• Calculate according todax_ranges NVDIMM
0x00000
0xA0200
0xA0300
0xA0500
0xA0400
System Memory PFN NVDIMM offset
PMEM (/dev/pmem0)
dax device offset
0x00400
Dax (/dev/dax0.0)
…
…
…
…
0x00500
0x00600
0xA0600
0xA0700
memory-failure
occurs on PFN(0xA0400)
46
2. RMAP from PMEM to filesystem (1/3)
Copyright 2021 FUJITSU LIMITED
◼ PMEM (FSDAX) may be used in different ways
1. Filesystem: XFS, EXT4…
2. Partitions
3. Mapped Device: LVM…
4. Nested by Partitions or mapped Devices
◼ Introduce dax_holder registration mechanism
◼ The holder represents the inner layer of a PMEM
◼ Register when holder being mounted / initialized
◼ Interface for notifying memory-failure
• holder_ops->notify_failure()
PMEM
Partition1
PMEM
XFS filesystem
PMEM
VG
XFS
Partition2
EXT4
LV1 LV2
PMEM
Partition1 Partition2
VG
LV1
Usages of PMEM
47
2. RMAP from PMEM to filesystem (2/3)
1. Filesystem as a holder
◼ mkfs directly on a PMEM
• `mkfs.xfs /dev/pmem0`
◼ No partition in PMEM
◼ Need translation
• Remove the fixed bdev header length
2. Partition as a holder
◼ Parted by tools
◼ More than one partitions
◼ Need translation
• iterate each partition’s range to get the location
• remove the offset of the partition header
Copyright 2021 FUJITSU LIMITED
PMEM
Partition1 Partition2
XFS EXT4
PMEM
XFS filesystem
bdev header
bdev header
Partitionheader
48
2. RMAP from PMEM to filesystem (3/3)
Copyright 2021 FUJITSU LIMITED
3. Mapped device as a holder
◼ Created by LVM or other tools
◼ No partition, but one or many dm targets
◼ Types of dm targets
• Linear, Raid, Crypt…
◼ Introduce RMAP from dm target to inner layer
• Implement dm_target->rmap() in each target
• This is reversed progress of the exist dm_target->map()
◼ Need translation
• Iterate targets in a Mapped device
• Find out which target is the broken PMEM
• Remove dm_target offset for the inner layer
• Continue RMAPin inner layer
• The inner layer is also a holder (filesystem, partition…)
PMEM1 PMEM2
EXT4
XFS
LV1 LV2
VG
bdev header
dm_target
offset dm target 1 dm target 2
49
◼ Require filesystem has ‘rmapbt’ feature
◼ rmapbt: Given an offset and length, search for extents contains it
◼ XFS provides the query interface
• xfs_rmap_query_range()
◼ Search result could be
• File content
➢ Kill processes whoare usingthis file
➢ Try to recovery filedata according to XFS logdevice
• Filesystem metadata
➢ Hard to recovery online, shutdown filesystem and report error
◼ Improve process collection and killingfor FSDAX
◼ The original method is page-based
• one page indicate one file (1-to-1)
◼ Need tobe changed tofile-based
Copyright 2021 FUJITSU LIMITED
3. RMAP from filesystem to file
50
4. Compatibility for no-reflink / no-rmapbt filesystem
◼ Need to keep the original 1 page-to-1 file association
◼ Associate page->mapping & ->offset
• Establish the RMAP relationship with page and file
◼ Not only keep the relationship, but also avoid the error
• Currently it reports error if called more than once
• Make it associate only once and only for the first time
◼ Need to keep the original RMAP routine
◼ The page-based memory-failure handler still works with support of above
◼ Fall back if 1-to-N RMAP routine get ‘-EOPNOTSUPP’
Copyright 2021 FUJITSU LIMITED
NVDIMM
file A
0xA0400
Associate
51
Summary of the new 1-to-N RMAP solution
◼ 1-to-N RMAP has been implemented
◼ compatible for all NVDIMM modes
◼ compatible for all usage of PMEM
◼ compatible for all filesystems
◼ What is the next?
◼ Need to fix the race condition against unbind
• With the help of 1-to-N RMAP, this can be fixed
• My new code is basically for one page(PTE or PMD) of memory failure
• Unbind is likely “a wide range of memory failure”, then we hope my code will help such case
Copyright 2021 FUJITSU LIMITED
52
Conclusion
Copyright 2021 FUJITSU LIMITED
53
We talked about the followings
◼ Basis of NVDIMM for Linux
◼ Issues of Filesystem-DAX (Direct Access mode)
◼ Deep dive to solve issues of Filesystem-DAX
• Support reflink & dedupefor fsdax
• Fix NVDIMM-based Reverse mapping
Copyright 2021 FUJITSU LIMITED
• Community has made many enhancement for NVDIMM on Linux
• We have worked for NVDIMM to remove experimental statusof Filesystem DAX
• We hope it will be achieved as soon as possible
54
Copyright 2021 FUJITSU LIMITED
55
Appendix
Copyright 2021 FUJITSU LIMITED
56
How to use NVDIMM in Linux
◼ Make region
◼ You can configure it on BIOS screen (or ipmctl command for Intel DCPMM)
• Region is createdby hardware (memory controller)
◼ You need to reboot to enable the created region
◼ Make namespace
◼ Namespace is similar concept against SCSI LUN
◼ You can configure it by ndctl command
◼ Format Filesystem (storage or Filesystem DAX)
◼ If you would like to use Filesystem DAX, you need to select ext4 or xfs which support Filesystem DAX
• Currently, if you select xfs, reflink option must bedisabled
◼ Mount Filesystem (storage or Filesystem DAX) with –o dax option
◼ Make pool (Filesystem DAX or Device DAX) by the PMDK tool if necessary
Copyright 2021 FUJITSU LIMITED
57
Users of NVDIMM at kernel layer
◼ dm-writecache
◼ Persistent cache for write at device mapper layer
• dm-cache can use SSD as cache, dm-writecache can use NVDIMM
• This presentation is helpful
• https://github.com/ChinaLinuxKernel/CLK2019/blob/master/clk2019-dm-writecache-04.pdf
◼ Kernel shows Device dax as DRAM
◼ Though Intel DCPMM has “Memory Mode” which user can use it as huge RAM, it has some pains
• Size of DRAM disappears, because it becomes just cache of NVDIMM
• Software cannot choose area between DRAM or DCPMM
◼ In this kernel feature, system RAM size becomes sums of DRAM and NVDIMM
• Use App Direct Mode and Device DAX namespace
• Use Memory Hot-add Device DAX area
• Becomes NUMA node
Copyright 2021 FUJITSU LIMITED
58
For log file
libpmemlog
For general purpose
pmem allocation with
transaction
libpmemobj
To create
samesizeblocks and
atomicallyupdate
libpmemblk
Transaction
support
PMDK (1/2)
◼ Set of libraries and tools for Filesystem DAX and Device DAX
Copyright 2021 FUJITSU LIMITED
Support for volatile
memoryusage
libmemkind
libvmemcache
For local
persistent memory
For remote access to
persistent memory
libpmem2 librpmem
Low-level support
C C++ PCJ Python
LLPJ
Language bindings InDevelopment:
PCJ – Persistent Collectionfor Java
LLPJ – Low-Level Persistent Java Library
Python bindings
libpmemkv
C C++ Java JS Ruby
59
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
• Not only Linux, but also
you can use PMDK on
Windows Note
• Not all components areincludedin this figure
• To bepreciselibmemkindis not memberofPMDK, but it is recommendedlibrary
libpmem
59
For log file
libpmemlog
For general purpose
pmem allocation with
transaction
libpmemobj
To create
samesizeblocks and
atomicallyupdate
libpmemblk
Transaction
support
PMDK (2/2)
◼ Typical libraries and tools
Copyright 2021 FUJITSU LIMITED
Support for volatile
memoryusage
libmemkind
libvmemcache
For local
persistent memory
For remote access to
persistent memory
libpmem2 librpmem
Low-level support
C C++ PCJ Python
LLPJ
Language bindings InDevelopment:
PCJ – Persistent Collectionfor Java
LLPJ – Low-Level Persistent Java Library
Python bindings
libpmemkv
C C++ Java JS Ruby
60
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
libpmem
• The low layer library
• Call mmap() to use NVDIMM, call
suitable cpu cache flush
instruction, etc…
• Libpmem2 is newer library.
Though it has some new feature,
API is different from old libpmem
60
For log file
libpmemlog
For general purpose
pmem allocation with
transaction
libpmemobj
To create
samesizeblocks and
atomicallyupdate
libpmemblk
Transaction
support
PMDK (2/2)
◼ Typical libraries and tools
Copyright 2021 FUJITSU LIMITED
Support for volatile
memoryusage
libmemkind
libvmemcache
For local
persistent memory
For remote access to
persistent memory
libpmem2 librpmem
Low-level support
C C++ PCJ Python
LLPJ
Language bindings InDevelopment:
PCJ – Persistent Collectionfor Java
LLPJ – Low-Level Persistent Java Library
Python bindings
libpmemkv
C C++ Java JS Ruby
61
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
libpmem
• The high layer library which supports transactionof the objects on DAX
• For general use-case
• This is highly recommended library in PMDK
• Users need to understand how to use its transaction
61
For log file
libpmemlog
For general purpose
pmem allocation with
transaction
libpmemobj
To create
samesizeblocks and
atomicallyupdate
libpmemblk
Transaction
support
PMDK (2/2)
◼ Typical libraries and tools
Copyright 2021 FUJITSU LIMITED
Support for volatile
memoryusage
libmemkind
libvmemcache
For local
persistent memory
For remote access to
persistent memory
libpmem2 librpmem
Low-level support
C C++ PCJ Python
LLPJ
Language bindings InDevelopment:
PCJ – Persistent Collectionfor Java
LLPJ – Low-Level Persistent Java Library
Python bindings
libpmemkv
C C++ Java JS Ruby
62
Tools
back up
device dax
namespace
daxio
pool
management
pmempool
libpmem
• /dev/dax (device DAX)is
character device. Then you
cannot use dd for backup
• daxiocommand is
providedinstead of it
62
What is new of libpmem2
◼ New low-layer library
• Introduce new concept “GRANULARITY”
• PMEM2_GRANULARITY_PAGE : for traditionalSSD/HDD
• PMEM2_GRANULARITY_CACHELINE : for persistent memory (the case for process needs flush cache to
make persistency)
• PMEM2_GRANULARITY_BYTE : for persistent memory (the case for platform support cpu cache
persistency)
• Introduce new functions to get unsafe shutdown status and bad block
• This library uses library of ndctl command internally toget these information
◼ Its interfaceis different from old libpmem
Copyright 2020 FUJITSU LIMITED
63
Library for RDMA (1/2)
◼ I think RDMA is becoming important for NVDIMM
◼ DAX offers direct access method for local NVDIMM
• However, modern system is a set of many computers which is connected by network
• So, remote access is also important
◼ Traditional network stacks is too heavy to access remote NVDIMM
◼ Use case
• For scalability
• Ex) Distributed filesystem, Key Value Store, etc….
• For make data replication
• Replace of NVDIMM module is difficult (as I talked at China Linux Kernel Developer Conference)
https://www.slideshare.net/ygotokernel/the-ideal-and-reality-of-nvdimm-ras-newer-version
• Etc.
Copyright 2021 FUJITSU LIMITED
RDMA is a good way to skip some redundant processing to access remote NVDIMM
64
Library for RDMA (2/2)
◼ librpma
◼ The 2nd
library for RDMA
• The 1st
library is librpmem of
PMDK, but it is experimental
• Not popular with users
• Librpma has been developed
with user’s requirement
◼ Characteristics
• Relatively easier interface than
libibverbs
• Consideration of making
persistency for remote NVDIMM
• Though hardware can return ack before write
completion, user can confirm it with librpma
• See: https://www.openfabrics.org/wp-content/uploads/2020-workshop-presentations/202.-gromadzki-ofa-workshop-2020.pdf
(*) The above table isquoted from this presentation
Copyright 2021 FUJITSU LIMITED
65

More Related Content

What's hot

30分でRHEL6 High Availability Add-Onを超絶的に理解しよう!
30分でRHEL6 High Availability Add-Onを超絶的に理解しよう!30分でRHEL6 High Availability Add-Onを超絶的に理解しよう!
30分でRHEL6 High Availability Add-Onを超絶的に理解しよう!
Etsuji Nakai
 
Interrupt Affinityについて
Interrupt AffinityについてInterrupt Affinityについて
Interrupt Affinityについて
Takuya ASADA
 

What's hot (20)

オンライン物理バックアップの排他モードと非排他モードについて(第15回PostgreSQLアンカンファレンス@オンライン 発表資料)
オンライン物理バックアップの排他モードと非排他モードについて(第15回PostgreSQLアンカンファレンス@オンライン 発表資料)オンライン物理バックアップの排他モードと非排他モードについて(第15回PostgreSQLアンカンファレンス@オンライン 発表資料)
オンライン物理バックアップの排他モードと非排他モードについて(第15回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
Kubernetes雑にまとめてみた 2020年8月版
Kubernetes雑にまとめてみた 2020年8月版Kubernetes雑にまとめてみた 2020年8月版
Kubernetes雑にまとめてみた 2020年8月版
 
PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)
PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)
PostgreSQLのリカバリ超入門(もしくはWAL、CHECKPOINT、オンラインバックアップの仕組み)
 
PostgreSQLの関数属性を知ろう
PostgreSQLの関数属性を知ろうPostgreSQLの関数属性を知ろう
PostgreSQLの関数属性を知ろう
 
Memoizeの仕組み(第41回PostgreSQLアンカンファレンス@オンライン 発表資料)
Memoizeの仕組み(第41回PostgreSQLアンカンファレンス@オンライン 発表資料)Memoizeの仕組み(第41回PostgreSQLアンカンファレンス@オンライン 発表資料)
Memoizeの仕組み(第41回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
いまさら聞けないPostgreSQL運用管理
いまさら聞けないPostgreSQL運用管理いまさら聞けないPostgreSQL運用管理
いまさら聞けないPostgreSQL運用管理
 
PostgreSQL DBのバックアップを一元化しよう
PostgreSQL DBのバックアップを一元化しようPostgreSQL DBのバックアップを一元化しよう
PostgreSQL DBのバックアップを一元化しよう
 
NEDIA_SNIA_CXL_講演資料.pdf
NEDIA_SNIA_CXL_講演資料.pdfNEDIA_SNIA_CXL_講演資料.pdf
NEDIA_SNIA_CXL_講演資料.pdf
 
30分でRHEL6 High Availability Add-Onを超絶的に理解しよう!
30分でRHEL6 High Availability Add-Onを超絶的に理解しよう!30分でRHEL6 High Availability Add-Onを超絶的に理解しよう!
30分でRHEL6 High Availability Add-Onを超絶的に理解しよう!
 
限界性能試験を自動化するOperatorを作ってみた(Kubernetes Novice Tokyo #14 発表資料)
限界性能試験を自動化するOperatorを作ってみた(Kubernetes Novice Tokyo #14 発表資料)限界性能試験を自動化するOperatorを作ってみた(Kubernetes Novice Tokyo #14 発表資料)
限界性能試験を自動化するOperatorを作ってみた(Kubernetes Novice Tokyo #14 発表資料)
 
押さえておきたい、PostgreSQL 13 の新機能!!(Open Source Conference 2021 Online/Hokkaido 発表資料)
押さえておきたい、PostgreSQL 13 の新機能!!(Open Source Conference 2021 Online/Hokkaido 発表資料)押さえておきたい、PostgreSQL 13 の新機能!!(Open Source Conference 2021 Online/Hokkaido 発表資料)
押さえておきたい、PostgreSQL 13 の新機能!!(Open Source Conference 2021 Online/Hokkaido 発表資料)
 
The Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux KernelThe Forefront of the Development for NVDIMM on Linux Kernel
The Forefront of the Development for NVDIMM on Linux Kernel
 
Yahoo! JAPANのOracle構成-2017年版
Yahoo! JAPANのOracle構成-2017年版Yahoo! JAPANのOracle構成-2017年版
Yahoo! JAPANのOracle構成-2017年版
 
速習!論理レプリケーション ~基礎から最新動向まで~(PostgreSQL Conference Japan 2022 発表資料)
速習!論理レプリケーション ~基礎から最新動向まで~(PostgreSQL Conference Japan 2022 発表資料)速習!論理レプリケーション ~基礎から最新動向まで~(PostgreSQL Conference Japan 2022 発表資料)
速習!論理レプリケーション ~基礎から最新動向まで~(PostgreSQL Conference Japan 2022 発表資料)
 
PostgreSQL Query Cache - "pqc"
PostgreSQL Query Cache - "pqc"PostgreSQL Query Cache - "pqc"
PostgreSQL Query Cache - "pqc"
 
不揮発性メモリ(PMEM)を利用したストレージエンジンの話 #mysql_jp #myna会 #yahoo #mysql #pmem #不揮発性メモリ
不揮発性メモリ(PMEM)を利用したストレージエンジンの話  #mysql_jp #myna会 #yahoo #mysql #pmem #不揮発性メモリ不揮発性メモリ(PMEM)を利用したストレージエンジンの話  #mysql_jp #myna会 #yahoo #mysql #pmem #不揮発性メモリ
不揮発性メモリ(PMEM)を利用したストレージエンジンの話 #mysql_jp #myna会 #yahoo #mysql #pmem #不揮発性メモリ
 
統計情報のリセットによるautovacuumへの影響について(第39回PostgreSQLアンカンファレンス@オンライン 発表資料)
統計情報のリセットによるautovacuumへの影響について(第39回PostgreSQLアンカンファレンス@オンライン 発表資料)統計情報のリセットによるautovacuumへの影響について(第39回PostgreSQLアンカンファレンス@オンライン 発表資料)
統計情報のリセットによるautovacuumへの影響について(第39回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
今、改めて考えるPostgreSQLプラットフォーム - マルチクラウドとポータビリティ -(PostgreSQL Conference Japan 20...
今、改めて考えるPostgreSQLプラットフォーム - マルチクラウドとポータビリティ -(PostgreSQL Conference Japan 20...今、改めて考えるPostgreSQLプラットフォーム - マルチクラウドとポータビリティ -(PostgreSQL Conference Japan 20...
今、改めて考えるPostgreSQLプラットフォーム - マルチクラウドとポータビリティ -(PostgreSQL Conference Japan 20...
 
Interrupt Affinityについて
Interrupt AffinityについてInterrupt Affinityについて
Interrupt Affinityについて
 
PostgreSQL初心者がパッチを提案してからコミットされるまで(第20回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQL初心者がパッチを提案してからコミットされるまで(第20回PostgreSQLアンカンファレンス@オンライン 発表資料)PostgreSQL初心者がパッチを提案してからコミットされるまで(第20回PostgreSQLアンカンファレンス@オンライン 発表資料)
PostgreSQL初心者がパッチを提案してからコミットされるまで(第20回PostgreSQLアンカンファレンス@オンライン 発表資料)
 

Similar to The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers conf. ver.)

We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster
We4IT Group
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Leons Petražickis
 
Data Sheet Of Dc
Data Sheet Of DcData Sheet Of Dc
Data Sheet Of Dc
guest679f05
 

Similar to The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers conf. ver.) (20)

The ideal and reality of NVDIMM RAS
The ideal and reality of NVDIMM RASThe ideal and reality of NVDIMM RAS
The ideal and reality of NVDIMM RAS
 
Tuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentTuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris Environment
 
We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster We4IT lcty 2013 - infra-man - domino run faster
We4IT lcty 2013 - infra-man - domino run faster
 
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-ReloadedRNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
RNUG 2020: HCL Notes 11.0.1 FP2 - Performance Boost Re-Reloaded
 
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-Reloaded
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-ReloadedRNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-Reloaded
RNUG - HCL Notes 11.0.1 FP2 — Performance Boost Re-Reloaded
 
Linux Huge Pages
Linux Huge PagesLinux Huge Pages
Linux Huge Pages
 
AdminCamp 2018 - IBM Notes V10 Performance Boost
AdminCamp 2018 - IBM Notes V10 Performance BoostAdminCamp 2018 - IBM Notes V10 Performance Boost
AdminCamp 2018 - IBM Notes V10 Performance Boost
 
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...
 
ICONUK 2018 - IBM Notes V10 Performance Boost
ICONUK 2018 - IBM Notes V10 Performance BoostICONUK 2018 - IBM Notes V10 Performance Boost
ICONUK 2018 - IBM Notes V10 Performance Boost
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Docker and the Linux Kernel
Docker and the Linux KernelDocker and the Linux Kernel
Docker and the Linux Kernel
 
CollabSphere 2020 - INF105 - HCL Notes 11.0.1 FP1 - Performance Boost Re-Relo...
CollabSphere 2020 - INF105 - HCL Notes 11.0.1 FP1 - Performance Boost Re-Relo...CollabSphere 2020 - INF105 - HCL Notes 11.0.1 FP1 - Performance Boost Re-Relo...
CollabSphere 2020 - INF105 - HCL Notes 11.0.1 FP1 - Performance Boost Re-Relo...
 
CollabSphere 2020 Live - HCL Notes 11.0.1 FP1 - Performance Boost Re-Reloaded
CollabSphere 2020 Live - HCL Notes 11.0.1 FP1 - Performance Boost Re-ReloadedCollabSphere 2020 Live - HCL Notes 11.0.1 FP1 - Performance Boost Re-Reloaded
CollabSphere 2020 Live - HCL Notes 11.0.1 FP1 - Performance Boost Re-Reloaded
 
2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt2007-05-23 Cecchet_PGCon2007.ppt
2007-05-23 Cecchet_PGCon2007.ppt
 
DNUG 2015 - Notes Browser Clients, Client Upgrades und beste Startzeiten!
DNUG 2015 - Notes Browser Clients, Client Upgrades und beste Startzeiten!DNUG 2015 - Notes Browser Clients, Client Upgrades und beste Startzeiten!
DNUG 2015 - Notes Browser Clients, Client Upgrades und beste Startzeiten!
 
Data Sheet Of Dc
Data Sheet Of DcData Sheet Of Dc
Data Sheet Of Dc
 
Data Sheet Of Dc
Data Sheet Of DcData Sheet Of Dc
Data Sheet Of Dc
 
Dlm ppt
Dlm pptDlm ppt
Dlm ppt
 
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
VMworld 2015: The Future of Software- Defined Storage- What Does it Look Like...
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 

Recently uploaded

JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
Max Lee
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
mbmh111980
 

Recently uploaded (20)

how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdf
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and Prevention
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion Production
 
How to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfHow to pick right visual testing tool.pdf
How to pick right visual testing tool.pdf
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 

The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers conf. ver.)

  • 1. The Forefront of the Development for NVDIMM on Linux Kernel (Linux Plumbers conf. ver.) 2021/Sep/23 Yasunori Goto (Fujitsu Limited.) Ruan Shiyang (Nanjing Fujitsu Nanda Software Technology Co., Ltd.) Copyright 2021 FUJITSU LIMITED 0
  • 2. Agenda ◼ Summary of current status of developmentfor NVDIMM (Yasunori) ◼ Basis of NVDIMM on Linux ◼ Issues of Filesystem-DAX (Direct Access mode) ◼ Deep dive to solve the issues of Filesystem-DAX(Ruan) ◼ Support reflink & dedupe for fsdax ◼ Fix NVDIMM-based Reverse mapping ◼ Conclusion Copyright 2021 FUJITSU LIMITED 1
  • 3. Self introduction ◼ Yasunori Goto ◼ I have worked for Linux and related OSS since 2002 • Development for Memory hotplug feature of Linux Kernel • Technical Support for troubles of Linux Kernel • etc. ◼ Currently, leader of Fujitsu Linux Kernel Developmentteam ◼ In the last few years, I have mainly worked for NVDIMM • some enhancement for RAS of NVDIMM • For Fault location feature • For Fault predictionfeature “The ideal and reality of NVDIMM RAS” https://www.slideshare.net/ygotokernel/the-ideal-and-reality-of-nvdimm-ras-newer-version Copyright 2021 FUJITSU LIMITED 2
  • 4. Basis of NVDIMM on Linux Copyright 2021 FUJITSU LIMITED 3
  • 5. Characteristics of Non-Volatile DIMM (NVDIMM) ◼ Persistent memory device which can be inserted to DIMM slot like DRAM ◼ CPU can read/write NVDIMM directly ◼ It can keep data persistency even if system is powered down or rebooted ◼ Latency, capacity, and cost have characteristics between DRAM and NVMe ◼ Use case ◼ Example • In memory Database • Hierarchical storage, distributedstorage • Key-Value-store ◼ Famous Product ◼ Intel Data Center Persistent Memory Module (DCPMM) Copyright 2021 FUJITSU LIMITED CPU cache DRAM NVDIMM NVMe SSD (SAS/SATA) HDD Capacity Latency cost 4
  • 6. Impact of NVDIMM Copyright 2021 FUJITSU LIMITED Page cache System call (Switch to kernel mode) Sync()/fsync()/msync() It was created for SLOW I/O storage, but NVDIMM is fast enough without page cache If user process calls cpu flush instruction, then it is enough to make persistency • Application can read/write to NVDIMM directly • Even system call may be waste of time due to switching between kernel mode and user mode Traditional I/O layer becomes redundant for NVDIMM New interface is expected for NVDIMM! 5
  • 7. ◼ Many software assumes that memory is VOLATILE yet NVDIMM is difficult for traditional software Copyright 2021 FUJITSU LIMITED • In older CPU, its cache is still volatile • If system power down suddenly,then some data may not be stored Need to prepare for sudden powerdown • Should not change data structures in NVDIMM • If the structures are changed, software update will be cause of disaster • If the data is broken, software need todetect it and correct it • Software need to assign not only free area, but also used area for reuse its data • In addition, kernel must assign the area to suitable process with authority check Need data structure compatibility on NVDIMM Need to detect / correct collapsing data Need data area management What will be necessary for a software to use NVDIMM? 6
  • 8. Confliction of requirements Copyright 2021 FUJITSU LIMITED Filesystem is still useful Filesystem is too slow • Traditional I/O stack is too heavy • CPU cache flush is enough to make persistency • Need new access interface to NVDIMM for next generation software • Filesystem gives many solution for the previous considerations • It is useful for current software • Format compatibility of filesystem • Data correction • Journaling, or CoW • region management • authority check • etc... • call system call • page cache • block device layer • etc… 7
  • 9. Interfaces of NVDIMM (1/3) ◼ Because of the previous reasons, Linux provides some interfaces for application ◼ Storage Access(green) • Applicationcan access NVDIMM with traditional I/O IFlike SSD/HDD • So, application can use this mode without any modification ◼ Filesystem DAX(blue) • Page cache is skippedwhen you use read()/write() on Filesystem-DAX • Applicationcan access NVDIMM area directly if it calls mmap() for a file • Need filesystem support • Xfs, ext4… • This mode is suitable for modifying current applications for NVDIMM. Copyright 2021 FUJITSU LIMITED DAX enabled filesystem NVDIMM user layer kernel layer Pmem driver device dax driver /dev/dax New App with Device DAX BTT driver Traditional filesystem Traditional App New App with Filesystem DAX PMDK management command sysfs ACPI driver ACPI 8
  • 10. Interfaces of NVDIMM (2/3) ◼ Because of the previous reasons, Linux provides some interfaces for application ◼ Device DAX(red) • Applicationcan access NVDIMM area directly if it calls mmap() for /dev/dax • /dev/dax allows only open(), mmap(), and close() • IOW, you can not use read()/write() nor any other system call • For innovative newapplicationwith NVDIMM ◼ PMDK(purple) is provided • Set of convenient libraries and tools for filesystem DAX and Device DAX • Transaction support for pmem applications • Pool management in the DAX file/device • Etc. Copyright 2021 FUJITSU LIMITED DAX enabled filesystem NVDIMM user layer kernel layer Pmem driver device dax driver /dev/dax New App with Device DAX BTT driver Traditional filesystem Traditional App New App with Filesystem DAX PMDK management command sysfs ACPI driver ACPI 9
  • 11. Interfaces of NVDIMM(3/3) ◼ NVDIMM is shown as device files like storage ◼ For Storage Access : /dev/pmem##s (# means number) ◼ Filesystem DAX: /dev/pmem## ◼ Device DAX: /dev/dax##.# ◼ ndctl(*) can create these device when it creates namespace ◼ Example of Filesystem DAX ◼ You can make filesystem on /dev/pmem##s or /dev/pmem## (*) a set of tools/commands for NVDIMM ◼ Note: /dev/dax##.# is character device ◼ Since you cannot use read()/write() for /dev/dax##.#, you cannot use dd command for backup ◼ You need to daxio command of PMDK instead of it Copyright 2021 FUJITSU LIMITED $ sudo ndctl create-namespace -e "namespace0.0" -m fsdax -f { "dev":"namespace0.0", "mode":"memory", "size":"5.90 GiB (6.34 GB)", "uuid":"dc47d0d7-7d8f-473e-9db4-1c2e473dbc8f", "blockdev":"pmem0", "numa_node":0 } $ sudo ls /dev/pmem* /dev/pmem0 10
  • 12. Filesystem-DAX is still experimental status Copyright 2021 FUJITSU LIMITED • The management way of NVDIMM is almost same with traditionalfilesystem • Operator can use traditional command tomanage NVDIMM area • Not only application can access NVDIMM area directly, but alsoit can use traditional systemcall Filesystem-DAX is very expected interface… • The “experimental” message is shown when the filesystem is mountedwith DAX option • There are difficult issues in kernel layer for some years But it is still experimental…. • In contrast, Device DAX requires pool management by tools of PMDK • Otherwise, a software need toposses whole of the namespace (/dev/dax) • In addition, Application can NOT use many systemcall in Device DAX What is the reason? 11
  • 13. Issues of Filesystem-DAX ◼ Whatis solved, and whatis current issues Copyright 2021 FUJITSU LIMITED 12
  • 14. Why Filesystem-DAX is experimental? ◼ In summary, there are 2 big reasons Copyright 2021 FUJITSU LIMITED Filesystem DAX combines storage and memory characteristics • This causes corner-case issues of Filesystem-DAX • They are often difficult problem More additional features were required, but they are/were difficult to make • Configure DAX on/off for each inode (directory or file) • Co-existence with CoW filesystem 13
  • 15. Corner Case Issue 1 : Update metadata(1/3) Copyright 2021 FUJITSU LIMITED The first problem is “updating metadata” of the file metadata data In Filesystem-DAX, we expected that application can make persistency of data with only cpu cache flush as I said…. However, this also means there is no chance to updatemetadata by kernel/filesystem file • Update time of the file may not be correct • If an application use write some data to file on the filesystem DAX, and a user remove some blocks of the file by truncate(2), kernel cannot negotiate it • Data of the file may be lost • If data transferred by DMA/RDMA to the page which is allocated as filesystem DAX, similar problem may occur 14
  • 16. Corner Case Issue 1 : Update metadata(2/3) ◼ Current Status of update metadata problems Copyright 2021 FUJITSU LIMITED • Solved by introducing new MAP_SYNC flag of mmap() • Page fault occurs every write access, then kernel can update meta data • PMDK specifies this flag • Solved by waiting truncate() until finishing RMDA • Not solved • Truncate() can not wait the completion of transfer, because it may too long time General write access DMA/RDMA data access Kernel/driver layer user process layer (E.g. infiniband, video(v4l2)) 15
  • 17. Corner Case Issue 1 : Update metadata(3/3) ◼ Current Status of update metadata problems Copyright 2021 FUJITSU LIMITED • Sloved by introducing new MAP_SYNC flag of mmap() • Page fault when write access, then kernel update meta data • PMDK specifies this flag • Solved by waiting truncate() until finishing RMDA • Not solved • Truncate() can not wait the completion of transfer, because it may too long time General write access DMA/RDMA data access Kernel/driver layer user process layer (E.g. infiniband, video(v4l2)) • In ODP, usually driver/hardware does not map the pages of DMA/RDMA area for application • It maps the pages when application accesses them • Kernel/driver can coordinate metadata at the time • Mellanox(NVIDIA) newer card has the feature Workaround: “On Demand Paging(ODP)” 16
  • 18. Corner Case Issue 2: unbind(1/2) ◼ Unbind is a sysfs interface to disconnect / hot-remove a device ◼ Each device driver provides its handler for it ◼ Though NVDIMM is not hotplug device physically, its interface can be used to disable and switch the mode of NVDIMM namespace • Ex) • To change namespace mode from Filesystem-DAX toDevice-DAX • To allowthat users can NVDIMM like normal RAM • Etc. Copyright 2021 FUJITSU LIMITED Example of“how to use Device DAX namespace like a normal RAM” 1) Remove the device from “Device DAX”infrastructure # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/remove_id # echo -n dax0.0 > /sys/bus/dax/drivers/device_dax/unbind 2) Bind it to kmem driver # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/new_id # echo -n dax0.0 > /sys/bus/dax/drivers/kmem/bind 17
  • 19. Corner Case Issue 2: unbind(2/2) ◼ Unbind is likely “surprising remove” interface ◼ There is no way to fail of unbind even if a user is using it • It must be disabled forcibly https://lore.kernel.org/linux-btrfs/CAPcyv4g3ZwbdLFx8bqMcNvXyrob8y6sBXXu=xPTmTY0VSk5HCw@mail.gmail.com/ ◼ To solve this problem, Filesystem-dax needs to disable a range of NVDIMM area immediately ◼ Currently, this is not solved yet ◼ It will be solved after the end of Ruan's work which will be talked by him today • His new code will help tosolve it Copyright 2021 FUJITSU LIMITED A race condition was reported between Filesystem-dax and unbind in 2021/Feb 18
  • 20. DAX on/off for each inode (directory or file) (1/3) ◼ Expected use-cases Copyright 2021 FUJITSU LIMITED Need more fine-grain settings Change DAX attribute by application Performance tuning Workaround when Filesystem-DAX has a bug Configuration is always painful for administrator If application can detect and change it, it will be helpful for them Users may want to change the DAX mode depending on each file Since the write latency of NVDIMM is a bit slower than RAM, User may want to use page cache by DAX off setting 19
  • 21. DAX on/off for each inode (directory or file) (2/3) ◼ What was the problem ◼ If filesystem change DAX attribute, filesystem need to change methods of filesystem between DAX and normal file, but they may be executed yet ◼ Data of the page cache must be moved silently when the dax attribute becomes off • These problems were very difficult Copyright 2021 FUJITSU LIMITED Page caches On DRAM xfs_vm_writepages() xfs_iomap_set_page_dirty() : : Methods for DAX file Methods for normal file Need change xfs_dax_writepages() noop_set_page_dirty() : : struct address space inode struct address space inode Nopage cache Normal file DAX file 20
  • 22. DAX on/off for each inode (directory or file) (3/3) ◼ Fortunately, this issue was solved ◼ The DAX attribute is changed only when its inode cache is NOT loaded on memory • Filesystem can load suitable methods for each attribute when it reloads inode tomemory • Page caches of the file are alsodropped • Users can use this feature with the newmount option. #mount … -o dax=inode • The DAX attribute is changedby command ◼ Note • All of applications which use the target file must close it to change the dax attribute • Filesystem will postpone changing the DAX attribute until dropping inode cache and page cache of the file Copyright 2021 FUJITSU LIMITED Normal file DAX file drop inode cache drop inode cache reload reload DAX on : $xfs_io -c 'chattr +x’ <file or directory> DAX off: $xfs_io -c 'chattr -x’<file or directory> 21
  • 23. Coexisting with CoW (reflink/dedup) filesystem (1/2) ◼ The Copy on Write feature of filesystem (xfs:reflink/dedup, btrfs) ◼ If there is a same data block on different files, filesystem can merge it as same block ◼ So far, if only filesystem manages such block, it was enough • Since a page cache is allocated for each file of the block, memory management layer don’t need toknowit ◼ In Filesystem-DAX, it becomes problems • Merged block equals mergedmemory itself, it affects the memory failure case Copyright 2021 FUJITSU LIMITED File A File B offset 100 offset 200 22
  • 24. Coexisting with CoW (reflink/dedup) filesystem (2/2) ◼ Problems Copyright 2021 FUJITSU LIMITED Ruan-san will talk how to solve them • Currently, there is no implementation of reflink/dedupefor Filesystem-DAX • Iomap, which is newer io block layer instead of buffer_head, has interface for CoW filesystem • XFS filesystem DAX also uses iomap • But there is no code to use CoW and DAX at the same time Need actual CoW implementation for FilesystemDAX • When a memory failure occurs, need to kill processes which use the memory • To achieve it, kernel need to find all processes from the merged page/block • But a merged page has only one struct page • No space for plural files in it Needs to chase plural files from a merged page/block 23
  • 25. Deep dive to solve the issue of Filesystem-DAX ◼ Supportreflink/dedupe for FSDAX* ◼ ImproveNVDIMM-basedReverse mapping * Filesystem-DAX Copyright 2021 FUJITSU LIMITED 24
  • 26. Self-introduction ◼ Ruan Shiyang ◼ A Software Engineer of Fujitsu Nanda ◼ Experiencein Embedded development ◼ Currently focusing on Linux filesystem and persistentmemory Copyright 2021 FUJITSU LIMITED 25
  • 27. Create a XFS (reflink is enabled by default) and mount it with “dax” option Then error occurs The dmesg shows the reason Background ◼ fsdax is “EXPERIMENTAL” ◼ reflink and fsdax cannot work together Copyright 2021 FUJITSU LIMITED $ mkfs.xfs /dev/pmem0 && mount –o dax /dev/pmem0 /mnt mount: /mnt: wrong fs type, bad option, bad superblock on /dev/pmem0, missing codepage or helper program, or other error. XFS (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk XFS (pmem0): DAX and reflink cannot be used together! 26
  • 28. reflink ◼ Files share extents for same data ◼ Advantages ◼ Fast copy ◼ Save storage ◼ Copy on Write mechanism (CoW) ◼ Copy the shared extentsbefore writing data Copyright 2021 FUJITSU LIMITED File A File B Extent I File A Extent I Extent II copy File A File B Extent I File A Extent I File B File A Extent I write File B File A Extent I File B File A Extent I Copy then Write File Disk File Disk File Disk copy --reflink=always 27
  • 29. fsdax (Filesystem DAX) ◼ A mode of a NVDIMM namespace ◼ No page cache in I/O path ◼ Allows direct mappings to persistent memory media Copyright 2021 FUJITSU LIMITED read() write() mmap() vfs SSD/HDD page cache NVDIMM direct access (DAX) 28
  • 30. Why reflink and fsdax cannot work together? ◼Issues 1. Support reflink/dedupe for FSDAX ◼ Extent iomap interface ◼ Support CoW in fsdax ◼ Support Dedupe in fsdax 2. Improve NVDIMM-based Reverse mapping ◼ Support 1-to-N Reversed mapping for NVDIMM Copyright 2021 FUJITSU LIMITED 29
  • 31. Support reflink/dedupe for FSDAX ◼ Difference between Buffered-IO vs. FSDAX ◼ What must be implemented? ◼ How are they implemented? Copyright 2021 FUJITSU LIMITED 30
  • 32. Kernel page cache Comparison: Buffered-IO vs. FSDAX write() (1/2) Copyright 2021 FUJITSU LIMITED Sync to disk Get destination from XFS ( AllocateDelay extents ) CoW Read from disk Write data Disk SSD / HDD User Space write() ◼ Requires page cache 1. Read(Copy)destinationdata - from disk - to page cache 2. Write user data - from userspace - to page cache 3. Sync page cache to disk - remap CoWextents The progress of using page cache indicate the CoW operation. 1 2 Mark dirty iomap_begin() xfs_buffered_write_iomap_begin() iomap_end() xfs_buffered_write_end() actor() iomap_write_actor() Clean up if write fails Buffered-IO 3 31
  • 33. Kernel Comparison: Buffered-IO vs. FSDAX write() (2/2) Copyright 2021 FUJITSU LIMITED Disk NVDIMM User Space write() ◼ No page cache - directly access to NVDIMM 1. Not Copy - get NVDIMM addressonly 2. Directly Write - user datato NVDIMM - no need tosync - cannot remap extents FSDAX iomap_begin() xfs_direct_write_iomap_begin() iomap_end() NULL actor() dax_iomap_actor() Get NVDIMM address from direct_access() Get destination from XFS ( Allocateimmediateextents ) Write data 1 2 CoW is missing 32
  • 34. Kernel / FSDAX What must be implemented? ◼ Extent iomap interface ◼ Introduce ‘srcmap’ ◼ Fill ‘iomap’ & ‘srcmap’ for fsdax ◼ SupportCoW in fsdax ◼ Add CoW for write() ◼ Add CoW for mmap() ◼ Remap extents after CoW ◼ SupportDedupe in fsdax ◼ Add a ‘dax’ deduplication Copyright 2021 FUJITSU LIMITED • ALLOCATE new extent for CoW • STORE source extent info • COPY source data to destination • WRITE user data to NVDIMM • REMAP the extents of this file Enhancement in iomap framework iomap_begin() xfs_direct_write_iomap_begin() iomap_end() xfs_dax_write_iomap_end() actor() dax_iomap_actor() 33
  • 35. iomap interface - Introduce ‘srcmap’ ◼ What is necessary ◼ Require source info for CoW • Only use ‘iomap’ to tell destination to write • but CoW requires more info • source bno: where to copy from • source length: howmuch tocopy ◼ How? ◼ Introduce another struct called ‘srcmap’ • remember & pass the source info Copyright 2021 FUJITSU LIMITED File A Extent I copy File B * Introduced by Goldwyn Rodrigues: [PATCH] iomap: use a srcmapfor a read-modify-writeI/O ‘iomap’ • destination bno • destination length ‘srcmap’ • source bno • source length 34
  • 36. Kernel: FSDAX / XFS CoW iomap interface – Fill ‘iomap’ & ‘srcmap’ ◼ What is necessary ◼ Store source info in ‘srcmap’ • Only fill ‘iomap’ in ->iomap_begin() • Implementedby filesystem • Need to fill ‘srcmap’ for shared extent ◼ How? ◼ Add CoW branch • Allocate new extent for CoW • Fill ‘srcmap’ with destination extent • Fill ‘iomap’ with new extent • Set IOMAP_F_SHARED flag Copyright 2021 FUJITSU LIMITED get destination extent if shared? if HOLE? Y Y N allocate new extent fill ‘iomap’ with new extent set IOMAP_F_SHARED iomap_begin() xfs_direct_write_iomap_begin() fill ‘srcmap’ with destination 35
  • 37. Kernel: FSDAX CoW Support CoW - Add CoW for write() ◼ What is necessary ◼ Execute CoW in write() path • Only write user data to destination • Use ->direct_access() totranslate address - From: offset in block device - To: physical memory address in NVDIMM • Need a pre-copy before writing • src_addr: (translated) source address • dest_addr: (translated) destination address ◼ How? ◼ Add CoW branch • Copy source data from ‘srcmap’ to NVDIMM Copyright 2021 FUJITSU LIMITED Y get srcmap if CoW? copy source data to dest_addr ->direct_access() => src_addr N write user data to dest_addr actor() dax_iomap_actor() ->direct_access() => dest_addr from ‘iomap’ from ‘srcmap’ 36
  • 38. Kernel: FSDAX page fault CoW Support CoW - Add CoW for mmap() Copyright 2021 FUJITSU LIMITED Y get srcmap if CoW? copy source data to dest_addr ->direct_access() => src_addr N associate VMA with PFN ->direct_access() => PFN & dest_addr from ‘iomap’ from ‘srcmap’ ◼ What is necessary ◼ Execute CoW in page fault • FSDAX has its own specific PTE&PMDfault • Use iomap framework • Only find destinationPFN &associate VMA • Need a pre-copy before associating • PFN: destination PFN found by ->direct_access() • src_addr: (translated) source address • dest_addr: (translated) destination address ◼ How? ◼ Add CoW branch in PTE&PMD fault • Copy source data before VMA is associated • Prepare for writing user data in userspace 37
  • 39. Kernel: FSDAX / XFS iomap_end() xfs_dax_write_iomap_end() Support CoW - Remap extents after CoW ◼ What is necessary ◼ Remap extents for CoW • The new extent not been mapped • Metadata not been updated • File won’t contain the CoW extent • Need to update metadata after CoW ◼ How? ◼ Add remap in ->iomap_end() • Do remap operation if is CoW • Clean up if CoW operation fails Copyright 2021 FUJITSU LIMITED Y if CoW? remap extents N if error? Y clean up CoW extents Nothing to do N 38
  • 40. Support dedupe - Add a ‘dax’ deduplication ◼ What is necessary ◼ Deduplicate DAX files • Dedupe: reduce redundantdata on storage costs • Require a dedupe function • Only have generic dedupe function ➢ compare data in page caches • Not adapted to FSDAX ➢ no page cache ➢ directly compare data in NVDIMM ◼ Need a new dax compare function ◼ How? ◼ Introduce a DAX compare function • Compare data by memcmp() Copyright 2021 FUJITSU LIMITED File A compared same File B Extent I Extent II DAX compare function Kernel: FSDAX / Dedupe FileA ->direct_access() => a_addr FileB ->direct_access() => b_addr memcmp(a_addr, b_addr, len) result: same or not 39
  • 41. Improve the current NVDIMM-based Reverse mapping ◼ Why Reverse mapping needs to be improved ◼ Struggling to solve this issue ◼ What must be implemented? ◼ How are they implemented? Copyright 2021 FUJITSU LIMITED 40
  • 42. NVDIMM When Memory failure occurs Copyright 2021 FUJITSU LIMITED file A process 1 process 2 process 3 process 4 0xA000 0xA001 0xA002 0xA004 0xA005 0xA003 1. track all processes associating with the broken page on NVDIMM 2. send signal to kill those associated processes broken 41
  • 43. Need to improve NVDIMM-based Reverse mapping ◼ currently only support 1-to-1 mapping ◼ reflink on fsdax requires 1-to-N mapping Copyright 2021 FUJITSU LIMITED NVDIMM 0xA000 0xA001 0xA002 0xA004 0xA005 file A process 1 process 2 process 3 VMA VMA VMA file B process 4 process 5 VMA VMA ->mapping = mappingA ->index= offsetA file … … … 0xA003 Not supportedyet VMA VMA broken 42
  • 44. Struggling to solve this issue ◼ First idea was simple, but it was bad idea ◼ Simply make rbtree for 1-to-N rmap ◼ Current strategy after some struggles ◼ Chase filesystem internal to find 1-to-N relationship ◼ What is difficulty of this way? • Memory failure information is basically page unit. But we need to find where it is in filesystem • Filesystem may be created on partition, and/or LVM, it affects relative offset in the filesystem Copyright 2021 FUJITSU LIMITED No! Cause of huge over head see next page First idea mappingA offsetA mappingB offsetB mappingC offsetC rbroot refcount page->private page->index mappingA indexA page->mapping page->index NVDIMM file A 0xA0400 43
  • 45. Enhanced 1-to-N NVDIMM-basedRMAP File2 File3 Introduce 1-to-N NVDIMM-based RMAP Copyright 2021 FUJITSU LIMITED PMEM Middle Layers Middle Layers XFS EXT4 File1 System Memory … memory failure #MCE (Machine Check Exception) Sharing DAX file ProcessA Kill now ProcessB Kill later 1. MCE triggers ◼ memory-failure 2. 1-to-N RMAP through ◼ mm layer ◼ device driver ◼ block device layer ◼ filesystem layer ◼ files ◼ processes 3. Kill processes according to related file ◼ Kill current immediately ◼ Kill others later 1 2 3 44
  • 46. System Memory What must be implemented? 1. RMAP from NVDIMM driver to dax device 2. RMAP from dax device to filesystem ◼ Introduce dax_holder registration mechanism ◼ Types of holder • Filesystem • Partition • Mapped Device 3. RMAP from filesystem to file ◼ Require rmapbt feature ◼ Improve process collection and killing for FSDAX 4. Compatibility for no-reflink / no-rmapbt filesystem ◼ e.g., EXT4 does not support reflink & rmapbt Copyright 2021 FUJITSU LIMITED PMEM Middle Layers XFS Middle Layers EXT4 File … RMAP through all layers 1 2 File2 File3 4 3 45
  • 47. 1. RMAP from NVDIMM driver to dax device Copyright 2021 FUJITSU LIMITED ◼ Translate the PFN number into offset within a dax device ◼ PMEM driver (FSDAX [/dev/pmem0]) • Linear offset translation ◼ Dax driver (DEVDAX [/dev/dax0.0]) • Calculate according todax_ranges NVDIMM 0x00000 0xA0200 0xA0300 0xA0500 0xA0400 System Memory PFN NVDIMM offset PMEM (/dev/pmem0) dax device offset 0x00400 Dax (/dev/dax0.0) … … … … 0x00500 0x00600 0xA0600 0xA0700 memory-failure occurs on PFN(0xA0400) 46
  • 48. 2. RMAP from PMEM to filesystem (1/3) Copyright 2021 FUJITSU LIMITED ◼ PMEM (FSDAX) may be used in different ways 1. Filesystem: XFS, EXT4… 2. Partitions 3. Mapped Device: LVM… 4. Nested by Partitions or mapped Devices ◼ Introduce dax_holder registration mechanism ◼ The holder represents the inner layer of a PMEM ◼ Register when holder being mounted / initialized ◼ Interface for notifying memory-failure • holder_ops->notify_failure() PMEM Partition1 PMEM XFS filesystem PMEM VG XFS Partition2 EXT4 LV1 LV2 PMEM Partition1 Partition2 VG LV1 Usages of PMEM 47
  • 49. 2. RMAP from PMEM to filesystem (2/3) 1. Filesystem as a holder ◼ mkfs directly on a PMEM • `mkfs.xfs /dev/pmem0` ◼ No partition in PMEM ◼ Need translation • Remove the fixed bdev header length 2. Partition as a holder ◼ Parted by tools ◼ More than one partitions ◼ Need translation • iterate each partition’s range to get the location • remove the offset of the partition header Copyright 2021 FUJITSU LIMITED PMEM Partition1 Partition2 XFS EXT4 PMEM XFS filesystem bdev header bdev header Partitionheader 48
  • 50. 2. RMAP from PMEM to filesystem (3/3) Copyright 2021 FUJITSU LIMITED 3. Mapped device as a holder ◼ Created by LVM or other tools ◼ No partition, but one or many dm targets ◼ Types of dm targets • Linear, Raid, Crypt… ◼ Introduce RMAP from dm target to inner layer • Implement dm_target->rmap() in each target • This is reversed progress of the exist dm_target->map() ◼ Need translation • Iterate targets in a Mapped device • Find out which target is the broken PMEM • Remove dm_target offset for the inner layer • Continue RMAPin inner layer • The inner layer is also a holder (filesystem, partition…) PMEM1 PMEM2 EXT4 XFS LV1 LV2 VG bdev header dm_target offset dm target 1 dm target 2 49
  • 51. ◼ Require filesystem has ‘rmapbt’ feature ◼ rmapbt: Given an offset and length, search for extents contains it ◼ XFS provides the query interface • xfs_rmap_query_range() ◼ Search result could be • File content ➢ Kill processes whoare usingthis file ➢ Try to recovery filedata according to XFS logdevice • Filesystem metadata ➢ Hard to recovery online, shutdown filesystem and report error ◼ Improve process collection and killingfor FSDAX ◼ The original method is page-based • one page indicate one file (1-to-1) ◼ Need tobe changed tofile-based Copyright 2021 FUJITSU LIMITED 3. RMAP from filesystem to file 50
  • 52. 4. Compatibility for no-reflink / no-rmapbt filesystem ◼ Need to keep the original 1 page-to-1 file association ◼ Associate page->mapping & ->offset • Establish the RMAP relationship with page and file ◼ Not only keep the relationship, but also avoid the error • Currently it reports error if called more than once • Make it associate only once and only for the first time ◼ Need to keep the original RMAP routine ◼ The page-based memory-failure handler still works with support of above ◼ Fall back if 1-to-N RMAP routine get ‘-EOPNOTSUPP’ Copyright 2021 FUJITSU LIMITED NVDIMM file A 0xA0400 Associate 51
  • 53. Summary of the new 1-to-N RMAP solution ◼ 1-to-N RMAP has been implemented ◼ compatible for all NVDIMM modes ◼ compatible for all usage of PMEM ◼ compatible for all filesystems ◼ What is the next? ◼ Need to fix the race condition against unbind • With the help of 1-to-N RMAP, this can be fixed • My new code is basically for one page(PTE or PMD) of memory failure • Unbind is likely “a wide range of memory failure”, then we hope my code will help such case Copyright 2021 FUJITSU LIMITED 52
  • 55. We talked about the followings ◼ Basis of NVDIMM for Linux ◼ Issues of Filesystem-DAX (Direct Access mode) ◼ Deep dive to solve issues of Filesystem-DAX • Support reflink & dedupefor fsdax • Fix NVDIMM-based Reverse mapping Copyright 2021 FUJITSU LIMITED • Community has made many enhancement for NVDIMM on Linux • We have worked for NVDIMM to remove experimental statusof Filesystem DAX • We hope it will be achieved as soon as possible 54
  • 58. How to use NVDIMM in Linux ◼ Make region ◼ You can configure it on BIOS screen (or ipmctl command for Intel DCPMM) • Region is createdby hardware (memory controller) ◼ You need to reboot to enable the created region ◼ Make namespace ◼ Namespace is similar concept against SCSI LUN ◼ You can configure it by ndctl command ◼ Format Filesystem (storage or Filesystem DAX) ◼ If you would like to use Filesystem DAX, you need to select ext4 or xfs which support Filesystem DAX • Currently, if you select xfs, reflink option must bedisabled ◼ Mount Filesystem (storage or Filesystem DAX) with –o dax option ◼ Make pool (Filesystem DAX or Device DAX) by the PMDK tool if necessary Copyright 2021 FUJITSU LIMITED 57
  • 59. Users of NVDIMM at kernel layer ◼ dm-writecache ◼ Persistent cache for write at device mapper layer • dm-cache can use SSD as cache, dm-writecache can use NVDIMM • This presentation is helpful • https://github.com/ChinaLinuxKernel/CLK2019/blob/master/clk2019-dm-writecache-04.pdf ◼ Kernel shows Device dax as DRAM ◼ Though Intel DCPMM has “Memory Mode” which user can use it as huge RAM, it has some pains • Size of DRAM disappears, because it becomes just cache of NVDIMM • Software cannot choose area between DRAM or DCPMM ◼ In this kernel feature, system RAM size becomes sums of DRAM and NVDIMM • Use App Direct Mode and Device DAX namespace • Use Memory Hot-add Device DAX area • Becomes NUMA node Copyright 2021 FUJITSU LIMITED 58
  • 60. For log file libpmemlog For general purpose pmem allocation with transaction libpmemobj To create samesizeblocks and atomicallyupdate libpmemblk Transaction support PMDK (1/2) ◼ Set of libraries and tools for Filesystem DAX and Device DAX Copyright 2021 FUJITSU LIMITED Support for volatile memoryusage libmemkind libvmemcache For local persistent memory For remote access to persistent memory libpmem2 librpmem Low-level support C C++ PCJ Python LLPJ Language bindings InDevelopment: PCJ – Persistent Collectionfor Java LLPJ – Low-Level Persistent Java Library Python bindings libpmemkv C C++ Java JS Ruby 59 Tools back up device dax namespace daxio pool management pmempool • Not only Linux, but also you can use PMDK on Windows Note • Not all components areincludedin this figure • To bepreciselibmemkindis not memberofPMDK, but it is recommendedlibrary libpmem 59
  • 61. For log file libpmemlog For general purpose pmem allocation with transaction libpmemobj To create samesizeblocks and atomicallyupdate libpmemblk Transaction support PMDK (2/2) ◼ Typical libraries and tools Copyright 2021 FUJITSU LIMITED Support for volatile memoryusage libmemkind libvmemcache For local persistent memory For remote access to persistent memory libpmem2 librpmem Low-level support C C++ PCJ Python LLPJ Language bindings InDevelopment: PCJ – Persistent Collectionfor Java LLPJ – Low-Level Persistent Java Library Python bindings libpmemkv C C++ Java JS Ruby 60 Tools back up device dax namespace daxio pool management pmempool libpmem • The low layer library • Call mmap() to use NVDIMM, call suitable cpu cache flush instruction, etc… • Libpmem2 is newer library. Though it has some new feature, API is different from old libpmem 60
  • 62. For log file libpmemlog For general purpose pmem allocation with transaction libpmemobj To create samesizeblocks and atomicallyupdate libpmemblk Transaction support PMDK (2/2) ◼ Typical libraries and tools Copyright 2021 FUJITSU LIMITED Support for volatile memoryusage libmemkind libvmemcache For local persistent memory For remote access to persistent memory libpmem2 librpmem Low-level support C C++ PCJ Python LLPJ Language bindings InDevelopment: PCJ – Persistent Collectionfor Java LLPJ – Low-Level Persistent Java Library Python bindings libpmemkv C C++ Java JS Ruby 61 Tools back up device dax namespace daxio pool management pmempool libpmem • The high layer library which supports transactionof the objects on DAX • For general use-case • This is highly recommended library in PMDK • Users need to understand how to use its transaction 61
  • 63. For log file libpmemlog For general purpose pmem allocation with transaction libpmemobj To create samesizeblocks and atomicallyupdate libpmemblk Transaction support PMDK (2/2) ◼ Typical libraries and tools Copyright 2021 FUJITSU LIMITED Support for volatile memoryusage libmemkind libvmemcache For local persistent memory For remote access to persistent memory libpmem2 librpmem Low-level support C C++ PCJ Python LLPJ Language bindings InDevelopment: PCJ – Persistent Collectionfor Java LLPJ – Low-Level Persistent Java Library Python bindings libpmemkv C C++ Java JS Ruby 62 Tools back up device dax namespace daxio pool management pmempool libpmem • /dev/dax (device DAX)is character device. Then you cannot use dd for backup • daxiocommand is providedinstead of it 62
  • 64. What is new of libpmem2 ◼ New low-layer library • Introduce new concept “GRANULARITY” • PMEM2_GRANULARITY_PAGE : for traditionalSSD/HDD • PMEM2_GRANULARITY_CACHELINE : for persistent memory (the case for process needs flush cache to make persistency) • PMEM2_GRANULARITY_BYTE : for persistent memory (the case for platform support cpu cache persistency) • Introduce new functions to get unsafe shutdown status and bad block • This library uses library of ndctl command internally toget these information ◼ Its interfaceis different from old libpmem Copyright 2020 FUJITSU LIMITED 63
  • 65. Library for RDMA (1/2) ◼ I think RDMA is becoming important for NVDIMM ◼ DAX offers direct access method for local NVDIMM • However, modern system is a set of many computers which is connected by network • So, remote access is also important ◼ Traditional network stacks is too heavy to access remote NVDIMM ◼ Use case • For scalability • Ex) Distributed filesystem, Key Value Store, etc…. • For make data replication • Replace of NVDIMM module is difficult (as I talked at China Linux Kernel Developer Conference) https://www.slideshare.net/ygotokernel/the-ideal-and-reality-of-nvdimm-ras-newer-version • Etc. Copyright 2021 FUJITSU LIMITED RDMA is a good way to skip some redundant processing to access remote NVDIMM 64
  • 66. Library for RDMA (2/2) ◼ librpma ◼ The 2nd library for RDMA • The 1st library is librpmem of PMDK, but it is experimental • Not popular with users • Librpma has been developed with user’s requirement ◼ Characteristics • Relatively easier interface than libibverbs • Consideration of making persistency for remote NVDIMM • Though hardware can return ack before write completion, user can confirm it with librpma • See: https://www.openfabrics.org/wp-content/uploads/2020-workshop-presentations/202.-gromadzki-ofa-workshop-2020.pdf (*) The above table isquoted from this presentation Copyright 2021 FUJITSU LIMITED 65