SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
How to Improve GFS/GFS2 File System Performance
and Prevent Processes from Hanging
Author: John Ruemker, Shane Bradley, and Steven Whitehouse
Editor: Allison Pranger
02/04/2009, 10/12/2010

OVERVIEW
Cluster file systems such as the Red Hat Global File System (GFS) and Red Hat Global File System 2
(GFS2) are complex systems that allows multiple computers (nodes) to simultaneously share the same
storage device in a cluster.
There can be many reasons why performance does not match expectations. In some workloads or
environments, the overhead associated with distributed locking on GFS/GFS2 file systems might affect
performance or cause certain commands to appear to hang. This document addresses common problems
and how to avoid them, how to discover if a particular file system is affected by a problem, and how to know
if you have found a real bug (rather than just a performance issue).
This document is for users in the design stage of a cluster who want to know how to get the best from a
GFS/GFS2 file system, as well as for for users of GFS/GFS2 file systems who need to track down a
performance problem in the system.
NOTE: This document provides recommended values only. Values should be thoroughly tested before
implementing in a production environment. Under some workloads, they might have a negative impact on the
performance of GFS/GFS2 file systems.

Environment
• Red Hat Enterprise Linux 4 and later

Terminology
This document assumes a basic knowledge and understanding of file systems in general. The following
subsections briefly discuss relevant terminology.
Inodes and Resource Groups
In the framework of GFS/GFS2 file systems, inodes correspond to file-system objects like files, directories,
and symlinks.
A resource group corresponds to the way GFS and GFS2 keep track of areas within the file system. Each
resource group contains a number of file system blocks, and there are bitmaps associated with each
resource group that determine whether each block of that resource group is free, allocated for data, or
allocated for an inode. Since the file system is shared, the resource group information/bitmaps and inode
information must be kept synchronized between nodes so the file system remains consistent (not corrupted)
on all nodes of the cluster.

How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 1
Glocks
A glock (pronounced “gee-lock”) is a cluster-wide GFS lock. GFS/GFS2 file systems use glocks to
coordinate locking of file system resources such as inodes and resource groups. The glock subsystem
provides a cache-management function that is implemented using DLM as the underlying communication
layer.
Holders
When a process is using a GFS/GFS2 file-system resource, it locks the glock associated with that resource
and is said to be holding that glock. Each glock can have a number of holders that each lay claim on that
resource. Processes waiting for a glock are considered to be waiting to hold the glock, and they also have
holders attached to the glock, but in a waiting state.

THEORY OF OPERATION
Both GFS and GFS2 work like local file systems, except in regards to caching. In GFS/GFS2, caching is
controlled by glocks.
There are two essential things to know about caching in order to understand GFS/GFS2 performance
characteristics. The first is that the cache is split between nodes: either only a single node may cache a
particular part of the file system at one time, or, in the case of a particular part of the file system being read
but not modified, multiple nodes may cache the same part of the file system simultaneously. Caching
granularity is per inode or per resource group so that each object is associated with a glock (types 2 and 3,
respectively) that controls its caching.
The second thing to note is that there is no other form of communication between GFS/GFS2 nodes in the
file system. All cache-control information comes from the glock layer and the underlying lock manager
(DLM). When a node makes an exclusive-use access request (for a write or modification operation) to locally
cache some part of the file system that is currently in use elsewhere in the cluster, all the other cluster nodes
must write any pending changes and empty their caches. If a write or modification operation has just been
performed on another node, this requires both log flushing and writing back of data, which can be
tremendously slower than accessing data that is already cached locally.
These caching principles apply to directories as well as to files. Adding or removing a directory entry is the
same (from a caching point of view) as writing to a file, and reading the directory or looking up a single entry
is the same as reading a file. The speed is slower if the file or directory is larger, although it also depends on
how much of the file or directory needs to be read in order to complete the operation.
Reading cached data can be very fast. In GFS2, the code path used to read cached data is almost identical
to that used by the ext3 file system: the read path goes directly to the page cache in order to check the page
state and copy the data to the application. There will only be a call into the file system to refresh the pages if
the pages are non-existent or not up to date. GFS works slightly differently: it wraps the read call in a glock
directly; however, reading data that is already cached this way is still fast. You can read the same data at the
same speed in parallel across multiple nodes, and the effective transfer rate can be very large.
It is generally possible to achieve acceptable performance for most applications by being careful about how
files are accessed. Simply taking an application designed to run on a single node and moving it to a cluster
rarely improves performance. For further advice, contact Red Hat Support.
How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 2
FILE-SYSTEM DESIGN CONSIDERATIONS
Before putting a clustered file system into production, you should spend some time designing the file system
to allow for the least amount of contention between nodes in the cluster. Since access to file-system blocks
are controlled by glocks that potentially require inter-node communications, you will get the best
performance if you design your file system to avoid contention.

File/Directory Contention
If, for example, you have dozens of nodes that all mount the same GFS2 file system and all access the
same file, then access will only be fast if all nodes have read-only access (nodes mounted with the
noatime mount option). As soon as there is one writer to the shared file, the performance will slow down
dramatically. If the application knows when it has written a file that will not be used again on the local node,
then calling fsync and then fadvise/madvise with the DONT_NEED flag will help to speed up access from
other cluster nodes.
The other important item to note is that for directories, file create/unlink activity has the same effect as a
write to a regular file: it requires exclusive access to the directory to perform the operation and then
subsequent access from other nodes requires rereading the directory information into cache, which can be a
slow operation for large directories. It is usually better to split up directories with a lot of write activity into
several subdirectories that are indexed by a hash or some similar system in order to reduce the amount of
times each individual directory has to be reread from disk.

Resource-Group Contention
GFS/GFS2 file systems are logically divided into several areas known as resource groups. The size of the
resource groups can be controlled by the mkfs.gfs/mkfs.gfs2 command (-r parameter). The
GFS/GFS2 mkfs program attempts to estimate an optimal size for your resource groups, but it might not be
precise enough for optimal performance. If you have too many resource groups, the nodes in your cluster
might waste unnecessary time searching through tens of thousands of resource groups trying to find one
suitable for block allocation. On the other hand, if you have too few resource groups, each will cover a larger
area, so block allocations might suffer from the opposite problem: too much time wasted in glock contention
waiting for available resource groups. You might want to experiment with different resource-group sizes to
find one that optimizes system performance.

Block-Size Considerations
When the file system is formatted with the mkfs.gfs/mkfs.gfs2 command, you may specify a block size
with -b. If no size is specified, the default is 4K. Different block sizes will often provide different performance
characteristics for your application. Most hardware is designed to operate efficiently with the default block
size of 4K.
Using the default 4k block size is recommended for all file systems. However, if there is a requirement for
efficient storage of very small files, 1k should be considered the minimum block size (-b 1024). Ideal block
size depends on how the file system is used. You might want to experiment with different block sizes to find
one that optimizes system performance.

How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 3
MOUNT OPTIONS
Unless atime support is essential, Red Hat recommends setting noatime on every GFS/GFS2 mount
point. This will significantly improve performance since it prevents reads from turning into writes. With GFS2
on Red Hat Enterprise Linux 6 and later, there is also the option of relatime, which updates atime when
other timestamps are being updated; however, noatime is still recommended.
Do not use the journaled data mode (chattr +j) unless it is required. The default ordered-data mode will
prevent any uninitialized data from appearing in files after a crash. To ensure data is actually on disk, use
fsync(2) for both the parent directory and newly created files.

ANSWERS TO COMMON QUESTIONS
If your cluster is running slowly or appears to be stopped and you are not sure why, the steps below should
help to resolve the issue. Remember that Red Hat Support is always available to help.

First Steps
Begin by collecting the answers to several simple questions that Red Hat Support will ask when you contact
them:
• What is the workload on the cluster?
• What applications are running, and are they using large or small files?
• How are the files organized?
• What is the architecture of the cluster?
• How many nodes?
• What is the storage, and how is it attached?
• How large is the file system(s)?
• What are the timing constraints?
• Does the issue always occur at a certain time of day or have some relationship with a particular
event (for example, nightly backups)?
• Is the problem a performance problem (slow), a real bug (completely stuck, kernel panic, file-system
assertion), or a corruption issue (usually indicated by a file-system withdraw)?
• Is the problem reproducible, or did it happen only once?
• Does the problem occur on a single node or in a cluster situation?
• Does the problem occur on the same node, or does it move around?
Of course, not every situation will require the same set of information, but the answers to these questions
will give you a good idea of where to start looking for the root of the problem.
If the problem always occurs at a specific time, look for periodic processes that might be running (not all are
in crontab, but that is a good place to begin).

Is a Task Stuck or Just Slow?
It is often difficult to tell if a task is completely stuck or if it is just slow, but there are signs that point to poor
performance resulting from contention for glocks. One example is increased network traffic. A significant
amount of DLM network traffic indicates that there is a lot of locking activity and thus potentially a lot of
How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 4
cache invalidation. The function of a lock depends on each individual situation, and locking should be
assessed as a proportion of the total network bandwidth instead of measured against specific metrics. Also,
increased locking activity is only an indication of a problem and not a guarantee that one exists. The actual
level of locking activity is highly dependent on the workload.
Information from glock dumps can be used to show whether tasks are still making progress. Take two glock
dumps, spaced apart by a few seconds or a few minutes, and then look for glocks with a lot of waiting
holders (ignoring granted holders). If the same glocks have exactly the same list of waiting holders in the
second glock dump as they did in the first, it is an indication that the cluster has become stuck. If the list has
changed at all, then the cluster is just running slowly.
Sometimes taking a glock dump can take a long time due to the amount of data involved, which depends on
the number of cached glocks (GFS/GFS2 keep a large number of glocks in memory for performance
purposes). The time needed will change depending on the total memory size of the node in question and the
amount of GFS/GFS2 activity that has taken place on that node.
GFS2 tracepoints (available in Red Hat Enterprise Linux 6 and later) can also be used to monitor activity in
order to see whether any nodes are stuck.

Which Task is Involved in the Slowdown?
The glock dump file includes details of the tasks that have requested each glock (the owner of each holder).
This information can be used to find out which task is stuck or involved in contention for a glock.

Which Inode is Contended?
Glock numbers are made up of two parts. In the glock dump, glock numbers are represented as type,
number. Type 2 indicates an inode, and type 3 indicates a resource group. There are additional types of
glocks, but the majority of slowdown and glock-contention issues will be associated with these two glock
types.
The number of the type 2 glocks (inode glocks) indicates the disk location (in file system blocks) of the inode
and also serves as the inode identification number. You can convert the inode number listed in the glock
dump from hexadecimal to decimal format and use it to track down the inode associated with that glock.
Identifying the contended inode should be possible using find -inum, preferably on an otherwise idle file
system since it will try to read all the inodes in the file system, making any contention problem worse.

Why Does gfs2_quotad Appear Stuck When I'm Not Even Using Quotas?
The gfs2_quotad process performs two jobs. One of those is related to quotas, and the other is updating
the statfs information for the file system. If a problem occurs elsewhere in the file system, gfs2_quotad
often becomes stuck since the periodic writes to update the statfs information can become queued behind
other operations in the system. If gfs2_quotad appears stuck, it is usually a symptom of a different
problem elsewhere in the file system.

Is It Worth Trying to Reproduce a Problem While Only a Single Node is Mounted?
In almost every case, you should try to reproduce a problem while only a single node is mounted. If the
How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 5
problem does reproduce on a single node, it is probably not related to clustering at all. If the problem only
appears in the cluster, it indicates either an I/O issue or a contention issue on one or more inodes in the
cluster.

How Can I Calculate the Maximum Throughput of GFS/GFS2 on My Hardware?
Maximum throughput depends on the I/O pattern from the application, the I/O scheduler on each node, the
distribution of I/O among the nodes, and the characteristics of the hardware itself.
One simple example involves two nodes, each performing streaming writes to its own file on a GFS2 file
system. This scenario can be simulated at the block-device level by creating two streams of I/O to different
parts of the shared block device using dd. This test will allow you to measure the absolute maximum
performance that the hardware can sustain. Actual file-system performance will differ due to the overhead of
block allocation and file-system metadata, but this test will provide an upper limit.
In this example, we are assuming that the block device is a single, shared, rotational hard disk that is able to
write to only a single sector at once. The two streams of I/O will be sent to the disk by the I/O schedulers on
the two different nodes, each without any knowledge of the other. The disk must perform scheduling in order
to move the disk head between the two areas of the disk receiving the streams. This process will be slow,
and it might even be slower than writing the two streams of I/O sequentially from a single node.
If, on the other hand, we assume that the block device in the example is a RAID array with many spindles,
then the two streams of I/O may be written at the same time without having to move disk heads between the
two areas. This will improve performance.
Storage hardware must be specified according to the expected I/O patterns from the nodes so that it can be
capable of delivering the level of performance that will support file-system requirements.

What About I/O Barriers?
Beginning with Red Hat Enterprise Linux 6, GFS2 uses I/O barriers by default when flushing the log. Red
Hat recommends the use of I/O barriers for most block devices; however, barriers are not required in all
cases and can sometimes be detrimental to performance, depending on how the storage device implements
them. If the shared block device has no write cache or if the write cache is not volatile (for example, if it is
powered from a UPS or similar device), then you might wish to turn off barrier support.
You can prevent the use of I/O barriers by setting the nobarrier option with the mount command (or in
/etc/fstab). If the underlying storage does not support them, I/O barriers will automatically be turned off,
indicated by a log message and the appearance of the nobarrier option in /proc/mounts as if it had
been set using the command line.
GFS2 only issues a single barrier each time it flushes the log. The total number of barriers issued over time
can be minimized by reducing the number of operations that result in a log flush (for example, fsync(2) or
glock contention) or (depending on workload) increasing the journal size. This has potential side-effects, so
you should attempt to strike a balance between performance and the potential for data loss in the event of a
node failure.

How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 6
Can I Use Discard Requests for Thin-Provisioning?
On Red Hat Enterprise Linux 6 and later, GFS2 supports the generation of discard requests. These requests
allow the file system to tell the underlying block device which blocks are no longer required due to
deallocation of a file or directory. Sending a discard request implies an I/O barrier.
There is a small performance penalty when these requests are generated by GFS2, and the performance
penalty might be larger depending on how the underlying storage device interprets the requests. In order to
increase performance and decrease overhead, GFS2 saves up as many requests as possible and merges
them into a single request whenever it is able.
In order to turn on this feature, you need to set the discard option with the mount command. This feature
will only work when both the volume manager and the underlying storage device support the requests.
Some storage devices might consider the requests a suggestion rather than a requirement (for example, if
the file system requests a discard of a single block, but the underlying storage is only able to discard larger
chunks of storage). Other storage devices might not deallocate any storage at all, but they might use the
suggestion to implement a secure delete by zeroing out the selected blocks.

Are there Benefits to Using Solid-State Storage with GFS/GFS2?
In general, solid-state storage has a much lower seek time, which can improve overall system performance,
particularly where glock contention is a major factor.

Is Network Traffic a Major Factor in GFS/GFS2 Performance?
Network traffic should not greatly affect overall system performance, provided that there is enough
bandwidth for the cluster infrastructure to keep quorum and synchronize any POSIX locks, that multicast is
working between all the nodes, and that DLM is able to communicate its lock requests.
However, if a network device is shared between storage and/or application traffic as well as cluster traffic,
Red Hat recommends using tc to implement suitable bandwidth limits. The use of jumbograms is not
recommended unless the network is carrying storage traffic. Latency is generally regarded as more
important than overall throughput in terms of GFS/GFS2 performance.
Watching traffic levels to ensure that the links do not saturate is a sensible policy since that can be a
warning sign of other issues (such as contention), but traffic statistics otherwise do not provide a great deal
of useful information.

What if I Have a Different Problem?
Red Hat Support representatives are available to help if you cannot find the answer to your question here.
If you have experienced a kernel Oops, assertion, or similar problem, contact Red Hat Support immediately.
If you have experienced a file-system withdraw, it is almost always due to corruption of the file system, and
fsck.gfs/fsck.gfs2 can usually fix the problem. Unmount the file system on all nodes, take a backup,
and then run fsck on the file system. Keep any output from fsck, as it will be needed in the event that
fsck cannot fix the problem. Contact Red Hat Support if fsck fails to resolve the problem.

How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 7
APPLICATIONS
Email (imap/sendmail/etc.)
Locality of access often causes problems when email is run on clustered file systems. To optimize
performance, Red Hat recommends arranging for users to have a "home" node to which each user connects
by default, assuming that all the cluster nodes are working normally. If a problem occurs in the cluster, users
can then be moved to another home node where all of their files are cached. This reduces cross-node
invalidation.
It is also possible to use the same technique on the delivery side by setting up MTA to forward to the user's
home node and letting that node write to the file system. If that node isn't available, then MTA can write the
message directly. Using maildir instead of mbox also helps scalability since each message, rather than
the whole mailbox, has its own lock. When performance issues occur in maildir setups, they are almost
always the result of contention on the directory lock.

Backup
Backup can affect performance since the process of backing up a node or set of nodes usually involves
reading the entire file system in sequence. If a single node performs the backup, that node will retain all that
information in cache until other nodes in the cluster start requesting locks.
While running this type of backup program while the cluster is in operation is a sure way to reduce system
performance, there are a number of ways to reduce the detrimental effect. One way is to drop the caches
using echo -n 3 >/proc/sys/vm/drop_caches after the backup has completed. This reduces the
time required by other nodes to get their glocks/caches back. However, this method is not ideal because the
other nodes will have stopped caching the data that they were caching before the backup process started.
Also, there is an effect on the overall cluster performance while the backup is running, which is often not
acceptable.
Another method is to back up at the block-device level and take a snapshot. This is currently only supported
in cases where there is provision for a snapshot at the hardware (storage array) level.
A better solution is to back up the working set of each cluster node from the node itself. This distributes the
workload of the backup across the cluster and keeps the working set in cache on the nodes in question. This
often requires custom scripting.

Web Servers
GFS/GFS2 file systems are ideally suited as web servers since serving web pages tends to involve a large
amount of data that can be cached on all nodes. Issues can arise when data has to be updated, but you can
reduce the potential for contention by preparing a new copy of the website and switching over rather than
trying to update the files in place. Red Hat recommends making the root of the website a bind mount and
using mount --move to have the web server(s) use a new set of files.

How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 8
SYSTEM CALLS
Read/Write
Read/write performance should be acceptable for most applications, provided you are careful not to cause
too many cross-node accesses that require cache sync and/or invalidation.
Streaming writes on GFS2 are currently slower than on GFS. This is a direct consequence of the locking
hierarchy and results from GFS2 performing writes on a per-page basis like other (local) file systems. Each
page written has a certain amount of overhead. Due to the different lock ordering, GFS does not suffer from
the same problem since it is able to perform the overhead operations once for multiple pages.
Speed of multiple-page write calls aside, there are many advantages to the GFS2 file system, including
faster performance for cached reads and simpler code for deadlock avoidance during complicated write calls
(for example, when the source page being written is from a memory-mapped file on a different file system
type). Red Hat is currently working to allow multiple-page writes, which will make GFS2’s streaming write
calls faster than the equivalent GFS operation.
This streaming-writes issue is the only known regression in speed between GFS and GFS2. Smaller writes
(page sized and below) on GFS2 are faster than on GFS.

Memory Mapping
GFS and GFS2 implement memory mapping differently. In GFS (and some earlier GFS2 kernels), a page
fault on a writable shared mapping would always result in an exclusive lock being taken for the inode in
question. This is consequence of an optimization that was originally introduced for local file systems where
pages would be made writable on the initial page fault in order to avoid a potential second fault later (if the
first access was a read and a subsequent access was a write).
In Red Hat Enterprise Linux 6 (and some later versions of Red Hat Enterprise Linux 5) kernels, GFS2 has
implemented a system of only providing a read-only mapping for read requests, significantly improving
scalability. A file that is mapped on multiple nodes of a GFS2 cluster in a shared writable manner can be
cached on all nodes, provided no writes occur.
NOTE: While in theory you can use the feature of shared writable memory mapping on a single shared file to
implement distributed shared memory, any such implementation would be very slow due to cache-bouncing
issues. This is not recommended. Sharing a read-only shared file in this way is acceptable, but only on recent
kernels with GFS2. If you need to share a file in this way on GFS, then open and map it read only to avoid
locking issues.

Cache Control (fsync/fadvise/madvise)
Both GFS and GFS2 support fsync(2), which functions the same way as in any local file system.
When using fsync(2) with numerous small files, Red Hat recommends sorting them by inode number. This
will improve performance by reducing the disk seeks required to complete the operation. If it is possible to
save up fsync(2) on a set of files and sync them all back together, it will help performance when
compared with using either O_SYNC or fsync(2) after each write.
To improve performance with GFS2, you can use the fadvise and/or madvise pair of system calls to
request read ahead or cache flushing when it is known that data will not be used again (GFS does not
How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 9
support the fadvise/madvise interface). Overall performance can be significantly improved by flushing the
page cache for an inode when it will not be used from a particular node again and is likely to be requested
by another node.
It is also possible to drop caches globally. Using echo -n 3 >/proc/sys/vm/drop_caches will drop
the caches for all file systems on a node and not just GFS/GFS2. This can be useful when you have a
problem that might be caused by caching and you want to run a cache cold test, for example. However, it
should not be used in the normal course of operation (see the Backup section above).

File Locking
The locking methods below are only recommendations as GFS and GFS2 do not support mandatory locks.
flock
The flock system call is implemented by type 6 glocks and works across the cluster in the normal way. It is
affected by the localflocks mount option, as described below. Flocks are a relatively fast method of file
locking and are preferred to fcntl locks on performance grounds (the difference becomes greater on
clusters with larger node counts).
fcntl (POSIX Locks)
POSIX locks have been supported on a single-node basis since the early days of GFS. The ability to use
these locks in a clustered environment was added in Red Hat Enterprise Linux 5. Unlike the other
GFS/GFS2 locking implementations, POSIX locks do not use DLM and are instead performed in user space
via corosync. By default, POSIX locks are rate limited to a maximum of 100 locks per second in order to
conserve network bandwidth that might otherwise be flooded with POSIX-lock requests. To raise the limit,
you can edit the cluster.conf file (setting the limit to 0 removes it altogether).
NOTE: Some applications using POSIX locks might try to use F_GETLK fcntl to try to obtain the PID of a
blocking process. This will work on GFS/GFS2, but due to clustering, the process might not be on the same
node as the process that used F_GETLK. Sending a signal is not as straightforward in this case, and care
should be taken not to send signals to the wrong processes.

It is possible to make use of POSIX locks on a single-node basis by setting the localflocks mount option.
This also affects flock(2), but it is not usually a problem since it is unusual to require both forms of locking
for a single application.
NOTE: Localflocks must be set for all NFS-exported GFS2 file systems, and the only supported NFS-overGFS/GFS2 solutions are those with only a single active NFS server at a time designed for active/passive
failover. NFS is not currently supported in combination with either Samba or local applications.

Due to the user-space implementation of POSIX locks, they are not suitable for high-performance locking
requirements.
Leases
Leases are not supported on either GFS or GFS2.

How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 10
DLM
There is no reason why applications should not make use of the DLM directly. Interfaces are available, and
details can be found in the DLM documentation.

RECOMMENDED TUNABLE SETTINGS
The following sections describe recommended values for GFS tunable parameters.

glock_purge
In Red Hat Enterprise Linux 4.6/5.1 and later, a GFS tunable parameter, glock_purge, has been added to
reduce the total number of locks cached for a particular file system on a cluster node.
NOTE: This setting does not exist in Red Hat Enterprise Linux 6 or later, and it is not a recommended solution
to any problem for which there is another solution. In Red Hat Enterprise Linux 6 and later, this parameter is self
tuning, and caching can be controlled from the userspace via the fsync/fadvise system calls as described
earlier in this document.

This tunable parameter defines the percentage of unused glocks for a file system to clear every five
seconds, as shown below, where X is an integer between 0 and 100 indicating the percentage to clear.
# gfs_tool settune /path/to/mount glock_purge X
A setting of 0 disables glock_purge. This value is typically set somewhere between 30-60 to start and can
be further tuned based on testing and performance benchmarks. This setting is not persistent, so it must be
reapplied every time the file system is mounted. It is typically placed in /etc/rc.local or
/etc/init.d/gfs in the start function on every node so that it is applied at boot time after the file systems
are mounted.

demote_secs
Another tunable parameter, demote_secs, can be used in conjunction with glock_purge. This tunable
parameter demotes GFS write locks into less restricted states and subsequently flushes the cache data into
disk. Shorter demote second(s) can be used to avoid accumulation of too much cached data, resulting in
burst-mode flushing activities and prolonging another node’s lock access.
NOTE: This setting does not exist in Red Hat Enterprise Linux 6 or later, and it is not a recommended solution
to any problem for which there is another solution. In Red Hat Enterprise Linux 6 and later, this parameter is self
tuning, and caching can be controlled from the userspace via the fsync/fadvise system calls as described
earlier in this document.

The default value is 300 seconds. To enable the demoting every 200 seconds on mount point /mnt/gfs1,
enter the following command:
$ gfs_tool settune /mnt/gfs1 demote_secs 200
To set back to default of 300 seconds, enter the following command:
$ gfs_tool settune /mnt/gfs1 demote_secs 300
Note that this setting only applies to an individual file system, so multiple commands must be used to apply it
How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 11
to more than one mount point.

statfs_fast
The statfs_fast tunable parameter can be used in Red Hat Enterprise Linux 4.5 or later to speed up
statfs calls on GFS.
NOTE: For Red Hat Enterprise Linux 6 and later, this can be set via the mount command line using the
statfs_quantum and statfs_percent mount arguments. This is the preferred method since it is then set at
mount time and does not require a separate tool to change it.

To enable statfs_fast, enter the following command:
# gfs_tool settune /path/to/mount statfs_fast 1
Red Hat recommends the use of mount options noquota, noatime, and nodiratime for GFS file
systems, if possible, as they are known to improve performance in many cases. They can be added in
/etc/fstab, as shown below.
/dev/clustervg/lv1

/mnt/appdata

gfs defaults,noquota,noatime,nodiratime 0 0

NOTE: An issue has been reported to Red Hat Engineering regarding the usage of noquota in Red Hat
Enterprise Linux 5.3: Why do I get a mount error reporting 'Invalid argument' on my GFS or GFS2 file system on
Red Hat Enterprise Linux 5.3?.

Disabling ls Colors
It might also be beneficial to remove the aliases for the ls command that cause it to display colors in its
output when using the bash or csh shells. By default, Red Hat Enterprise Linux systems are configured with
the following aliases from /etc/profile.d/colorls.sh and colorls.csh:
# alias | grep 'ls'
alias l.='ls -d .* --color=tty'
alias ll='ls -l --color=tty'
alias ls='ls --color=tty'
In situations where a GFS file system is slow to respond, the first response of many users is to run ls in
order to determine the problem. If the --color option is enabled, ls must run a stat() against every
entry, which creates additional lock requests and can create contention for those files with other processes.
This might exacerbate the problem and cause slower response times for processes accessing that file
system. To prevent ls from using the --color=tty option for all users, the following lines can be added to
the end of /etc/profile:
alias ll='ls -l' 2>/dev/null
alias l.='ls -d .*' 2>/dev/null
unalias ls
These lines can also be placed in a user’s ~/.bash_profile to disable --color=tty on an individual
basis.
In general, however, it is best to avoid excessive use of the ls command due to the locking overhead.
How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 12
Copyright © 2011 Red Hat, Inc. “Red Hat,” Red Hat Linux, the Red Hat “Shadowman” logo, and the products
listed are trademarks of Red Hat, Inc., registered in the U.S. and other countries. Linux® is the registered
trademark of Linus Torvalds in the U.S. and other countries.

www.redhat.com

Weitere ähnliche Inhalte

Was ist angesagt?

Best Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisJignesh Shah
 
Openstack Swift - Lots of small files
Openstack Swift - Lots of small filesOpenstack Swift - Lots of small files
Openstack Swift - Lots of small filesAlexandre Lecuyer
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringShapeBlue
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
EnrootとPyxisで快適コンテナ生活
EnrootとPyxisで快適コンテナ生活EnrootとPyxisで快適コンテナ生活
EnrootとPyxisで快適コンテナ生活Kuninobu SaSaki
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
 
Distributed Caching in Kubernetes with Hazelcast
Distributed Caching in Kubernetes with HazelcastDistributed Caching in Kubernetes with Hazelcast
Distributed Caching in Kubernetes with HazelcastMesut Celik
 
OVN - Basics and deep dive
OVN - Basics and deep diveOVN - Basics and deep dive
OVN - Basics and deep diveTrinath Somanchi
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephScyllaDB
 
Using ZFS file system with MySQL
Using ZFS file system with MySQLUsing ZFS file system with MySQL
Using ZFS file system with MySQLMydbops
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and ToolsBrendan Gregg
 
Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...HostedbyConfluent
 
LCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoCLCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoCLinaro
 
プロセスとコンテキストスイッチ
プロセスとコンテキストスイッチプロセスとコンテキストスイッチ
プロセスとコンテキストスイッチKazuki Onishi
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deploymentYoshinori Matsunobu
 
Configuring oracle enterprise manager cloud control 12 c for high availability
Configuring oracle enterprise manager cloud control 12 c for high availabilityConfiguring oracle enterprise manager cloud control 12 c for high availability
Configuring oracle enterprise manager cloud control 12 c for high availabilitySon Hyojin
 

Was ist angesagt? (20)

Best Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on Solaris
 
Openstack Swift - Lots of small files
Openstack Swift - Lots of small filesOpenstack Swift - Lots of small files
Openstack Swift - Lots of small files
 
Boosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uringBoosting I/O Performance with KVM io_uring
Boosting I/O Performance with KVM io_uring
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
EnrootとPyxisで快適コンテナ生活
EnrootとPyxisで快適コンテナ生活EnrootとPyxisで快適コンテナ生活
EnrootとPyxisで快適コンテナ生活
 
initramfsについて
initramfsについてinitramfsについて
initramfsについて
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
 
Microkernel Evolution
Microkernel EvolutionMicrokernel Evolution
Microkernel Evolution
 
Distributed Caching in Kubernetes with Hazelcast
Distributed Caching in Kubernetes with HazelcastDistributed Caching in Kubernetes with Hazelcast
Distributed Caching in Kubernetes with Hazelcast
 
OVN - Basics and deep dive
OVN - Basics and deep diveOVN - Basics and deep dive
OVN - Basics and deep dive
 
Seastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for CephSeastore: Next Generation Backing Store for Ceph
Seastore: Next Generation Backing Store for Ceph
 
Using ZFS file system with MySQL
Using ZFS file system with MySQLUsing ZFS file system with MySQL
Using ZFS file system with MySQL
 
Linux Performance Analysis and Tools
Linux Performance Analysis and ToolsLinux Performance Analysis and Tools
Linux Performance Analysis and Tools
 
Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...Kafka High Availability in multi data center setup with floating Observers wi...
Kafka High Availability in multi data center setup with floating Observers wi...
 
Xen Debugging
Xen DebuggingXen Debugging
Xen Debugging
 
B-link-tree
B-link-treeB-link-tree
B-link-tree
 
LCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoCLCU14-410: How to build an Energy Model for your SoC
LCU14-410: How to build an Energy Model for your SoC
 
プロセスとコンテキストスイッチ
プロセスとコンテキストスイッチプロセスとコンテキストスイッチ
プロセスとコンテキストスイッチ
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
Configuring oracle enterprise manager cloud control 12 c for high availability
Configuring oracle enterprise manager cloud control 12 c for high availabilityConfiguring oracle enterprise manager cloud control 12 c for high availability
Configuring oracle enterprise manager cloud control 12 c for high availability
 

Ähnlich wie How to Improve GFS/GFS2 File System Performance

Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS Dr Neelesh Jain
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems ReviewSchubert Zhang
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)Sri Prasanna
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answersKalyan Hadoop
 
Distributed computing seminar lecture 3 - distributed file systems
Distributed computing seminar   lecture 3 - distributed file systemsDistributed computing seminar   lecture 3 - distributed file systems
Distributed computing seminar lecture 3 - distributed file systemstugrulh
 
Google File System
Google File SystemGoogle File System
Google File Systemvivatechijri
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File SystemsManish Chopra
 
Distributed file systems
Distributed file systemsDistributed file systems
Distributed file systemsSri Prasanna
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsDrPDShebaKeziaMalarc
 
Google file system GFS
Google file system GFSGoogle file system GFS
Google file system GFSzihad164
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptxShimoFcis
 
Sector Vs Hadoop
Sector Vs HadoopSector Vs Hadoop
Sector Vs Hadooplilyco
 

Ähnlich wie How to Improve GFS/GFS2 File System Performance (20)

Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS  Cloud File System with GFS and HDFS
Cloud File System with GFS and HDFS
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems Review
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Lec3 Dfs
Lec3 DfsLec3 Dfs
Lec3 Dfs
 
Distributed computing seminar lecture 3 - distributed file systems
Distributed computing seminar   lecture 3 - distributed file systemsDistributed computing seminar   lecture 3 - distributed file systems
Distributed computing seminar lecture 3 - distributed file systems
 
Google File System
Google File SystemGoogle File System
Google File System
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File Systems
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Distributed file systems
Distributed file systemsDistributed file systems
Distributed file systems
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data Analytics
 
Google file system GFS
Google file system GFSGoogle file system GFS
Google file system GFS
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Gfs sosp2003
Gfs sosp2003Gfs sosp2003
Gfs sosp2003
 
Gfs
GfsGfs
Gfs
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptx
 
XFS.ppt
XFS.pptXFS.ppt
XFS.ppt
 
Sector Vs Hadoop
Sector Vs HadoopSector Vs Hadoop
Sector Vs Hadoop
 
HDFS.ppt
HDFS.pptHDFS.ppt
HDFS.ppt
 

Mehr von sprdd

Linux con europe_2014_full_system_rollback_btrfs_snapper_0
Linux con europe_2014_full_system_rollback_btrfs_snapper_0Linux con europe_2014_full_system_rollback_btrfs_snapper_0
Linux con europe_2014_full_system_rollback_btrfs_snapper_0sprdd
 
Linux con europe_2014_f
Linux con europe_2014_fLinux con europe_2014_f
Linux con europe_2014_fsprdd
 
Openstack v4 0
Openstack v4 0Openstack v4 0
Openstack v4 0sprdd
 
Hardware accelerated virtio networking for nfv linux con
Hardware accelerated virtio networking for nfv linux conHardware accelerated virtio networking for nfv linux con
Hardware accelerated virtio networking for nfv linux consprdd
 
난공불락세미나 Ldap
난공불락세미나 Ldap난공불락세미나 Ldap
난공불락세미나 Ldapsprdd
 
Lkda facebook seminar_140419
Lkda facebook seminar_140419Lkda facebook seminar_140419
Lkda facebook seminar_140419sprdd
 
Glusterfs 소개 v1.0_난공불락세미나
Glusterfs 소개 v1.0_난공불락세미나Glusterfs 소개 v1.0_난공불락세미나
Glusterfs 소개 v1.0_난공불락세미나sprdd
 
Zinst 패키지 기반의-리눅스_중앙관리시스템_20140415
Zinst 패키지 기반의-리눅스_중앙관리시스템_20140415Zinst 패키지 기반의-리눅스_중앙관리시스템_20140415
Zinst 패키지 기반의-리눅스_중앙관리시스템_20140415sprdd
 
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0sprdd
 
오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0sprdd
 
HP NMI WATCHDOG
HP NMI WATCHDOGHP NMI WATCHDOG
HP NMI WATCHDOGsprdd
 
H2890 emc-clariion-asymm-active-wp
H2890 emc-clariion-asymm-active-wpH2890 emc-clariion-asymm-active-wp
H2890 emc-clariion-asymm-active-wpsprdd
 
Cluster pitfalls recommand
Cluster pitfalls recommandCluster pitfalls recommand
Cluster pitfalls recommandsprdd
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1sprdd
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1sprdd
 
2013fooscoverpageimage 130417105210-phpapp01
2013fooscoverpageimage 130417105210-phpapp012013fooscoverpageimage 130417105210-phpapp01
2013fooscoverpageimage 130417105210-phpapp01sprdd
 
Openstackinsideoutv10 140222065532-phpapp01
Openstackinsideoutv10 140222065532-phpapp01Openstackinsideoutv10 140222065532-phpapp01
Openstackinsideoutv10 140222065532-phpapp01sprdd
 
Doldoggi bisiri
Doldoggi bisiriDoldoggi bisiri
Doldoggi bisirisprdd
 
유닉스 리눅스 마이그레이션_이호성_v1.0
유닉스 리눅스 마이그레이션_이호성_v1.0유닉스 리눅스 마이그레이션_이호성_v1.0
유닉스 리눅스 마이그레이션_이호성_v1.0sprdd
 
세미나설문
세미나설문세미나설문
세미나설문sprdd
 

Mehr von sprdd (20)

Linux con europe_2014_full_system_rollback_btrfs_snapper_0
Linux con europe_2014_full_system_rollback_btrfs_snapper_0Linux con europe_2014_full_system_rollback_btrfs_snapper_0
Linux con europe_2014_full_system_rollback_btrfs_snapper_0
 
Linux con europe_2014_f
Linux con europe_2014_fLinux con europe_2014_f
Linux con europe_2014_f
 
Openstack v4 0
Openstack v4 0Openstack v4 0
Openstack v4 0
 
Hardware accelerated virtio networking for nfv linux con
Hardware accelerated virtio networking for nfv linux conHardware accelerated virtio networking for nfv linux con
Hardware accelerated virtio networking for nfv linux con
 
난공불락세미나 Ldap
난공불락세미나 Ldap난공불락세미나 Ldap
난공불락세미나 Ldap
 
Lkda facebook seminar_140419
Lkda facebook seminar_140419Lkda facebook seminar_140419
Lkda facebook seminar_140419
 
Glusterfs 소개 v1.0_난공불락세미나
Glusterfs 소개 v1.0_난공불락세미나Glusterfs 소개 v1.0_난공불락세미나
Glusterfs 소개 v1.0_난공불락세미나
 
Zinst 패키지 기반의-리눅스_중앙관리시스템_20140415
Zinst 패키지 기반의-리눅스_중앙관리시스템_20140415Zinst 패키지 기반의-리눅스_중앙관리시스템_20140415
Zinst 패키지 기반의-리눅스_중앙관리시스템_20140415
 
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
 
오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0오픈소스컨설팅 클러스터제안 V1.0
오픈소스컨설팅 클러스터제안 V1.0
 
HP NMI WATCHDOG
HP NMI WATCHDOGHP NMI WATCHDOG
HP NMI WATCHDOG
 
H2890 emc-clariion-asymm-active-wp
H2890 emc-clariion-asymm-active-wpH2890 emc-clariion-asymm-active-wp
H2890 emc-clariion-asymm-active-wp
 
Cluster pitfalls recommand
Cluster pitfalls recommandCluster pitfalls recommand
Cluster pitfalls recommand
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
 
Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1Wheeler w 0450_linux_file_systems1
Wheeler w 0450_linux_file_systems1
 
2013fooscoverpageimage 130417105210-phpapp01
2013fooscoverpageimage 130417105210-phpapp012013fooscoverpageimage 130417105210-phpapp01
2013fooscoverpageimage 130417105210-phpapp01
 
Openstackinsideoutv10 140222065532-phpapp01
Openstackinsideoutv10 140222065532-phpapp01Openstackinsideoutv10 140222065532-phpapp01
Openstackinsideoutv10 140222065532-phpapp01
 
Doldoggi bisiri
Doldoggi bisiriDoldoggi bisiri
Doldoggi bisiri
 
유닉스 리눅스 마이그레이션_이호성_v1.0
유닉스 리눅스 마이그레이션_이호성_v1.0유닉스 리눅스 마이그레이션_이호성_v1.0
유닉스 리눅스 마이그레이션_이호성_v1.0
 
세미나설문
세미나설문세미나설문
세미나설문
 

Kürzlich hochgeladen

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Kürzlich hochgeladen (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

How to Improve GFS/GFS2 File System Performance

  • 1. How to Improve GFS/GFS2 File System Performance and Prevent Processes from Hanging Author: John Ruemker, Shane Bradley, and Steven Whitehouse Editor: Allison Pranger 02/04/2009, 10/12/2010 OVERVIEW Cluster file systems such as the Red Hat Global File System (GFS) and Red Hat Global File System 2 (GFS2) are complex systems that allows multiple computers (nodes) to simultaneously share the same storage device in a cluster. There can be many reasons why performance does not match expectations. In some workloads or environments, the overhead associated with distributed locking on GFS/GFS2 file systems might affect performance or cause certain commands to appear to hang. This document addresses common problems and how to avoid them, how to discover if a particular file system is affected by a problem, and how to know if you have found a real bug (rather than just a performance issue). This document is for users in the design stage of a cluster who want to know how to get the best from a GFS/GFS2 file system, as well as for for users of GFS/GFS2 file systems who need to track down a performance problem in the system. NOTE: This document provides recommended values only. Values should be thoroughly tested before implementing in a production environment. Under some workloads, they might have a negative impact on the performance of GFS/GFS2 file systems. Environment • Red Hat Enterprise Linux 4 and later Terminology This document assumes a basic knowledge and understanding of file systems in general. The following subsections briefly discuss relevant terminology. Inodes and Resource Groups In the framework of GFS/GFS2 file systems, inodes correspond to file-system objects like files, directories, and symlinks. A resource group corresponds to the way GFS and GFS2 keep track of areas within the file system. Each resource group contains a number of file system blocks, and there are bitmaps associated with each resource group that determine whether each block of that resource group is free, allocated for data, or allocated for an inode. Since the file system is shared, the resource group information/bitmaps and inode information must be kept synchronized between nodes so the file system remains consistent (not corrupted) on all nodes of the cluster. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 1
  • 2. Glocks A glock (pronounced “gee-lock”) is a cluster-wide GFS lock. GFS/GFS2 file systems use glocks to coordinate locking of file system resources such as inodes and resource groups. The glock subsystem provides a cache-management function that is implemented using DLM as the underlying communication layer. Holders When a process is using a GFS/GFS2 file-system resource, it locks the glock associated with that resource and is said to be holding that glock. Each glock can have a number of holders that each lay claim on that resource. Processes waiting for a glock are considered to be waiting to hold the glock, and they also have holders attached to the glock, but in a waiting state. THEORY OF OPERATION Both GFS and GFS2 work like local file systems, except in regards to caching. In GFS/GFS2, caching is controlled by glocks. There are two essential things to know about caching in order to understand GFS/GFS2 performance characteristics. The first is that the cache is split between nodes: either only a single node may cache a particular part of the file system at one time, or, in the case of a particular part of the file system being read but not modified, multiple nodes may cache the same part of the file system simultaneously. Caching granularity is per inode or per resource group so that each object is associated with a glock (types 2 and 3, respectively) that controls its caching. The second thing to note is that there is no other form of communication between GFS/GFS2 nodes in the file system. All cache-control information comes from the glock layer and the underlying lock manager (DLM). When a node makes an exclusive-use access request (for a write or modification operation) to locally cache some part of the file system that is currently in use elsewhere in the cluster, all the other cluster nodes must write any pending changes and empty their caches. If a write or modification operation has just been performed on another node, this requires both log flushing and writing back of data, which can be tremendously slower than accessing data that is already cached locally. These caching principles apply to directories as well as to files. Adding or removing a directory entry is the same (from a caching point of view) as writing to a file, and reading the directory or looking up a single entry is the same as reading a file. The speed is slower if the file or directory is larger, although it also depends on how much of the file or directory needs to be read in order to complete the operation. Reading cached data can be very fast. In GFS2, the code path used to read cached data is almost identical to that used by the ext3 file system: the read path goes directly to the page cache in order to check the page state and copy the data to the application. There will only be a call into the file system to refresh the pages if the pages are non-existent or not up to date. GFS works slightly differently: it wraps the read call in a glock directly; however, reading data that is already cached this way is still fast. You can read the same data at the same speed in parallel across multiple nodes, and the effective transfer rate can be very large. It is generally possible to achieve acceptable performance for most applications by being careful about how files are accessed. Simply taking an application designed to run on a single node and moving it to a cluster rarely improves performance. For further advice, contact Red Hat Support. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 2
  • 3. FILE-SYSTEM DESIGN CONSIDERATIONS Before putting a clustered file system into production, you should spend some time designing the file system to allow for the least amount of contention between nodes in the cluster. Since access to file-system blocks are controlled by glocks that potentially require inter-node communications, you will get the best performance if you design your file system to avoid contention. File/Directory Contention If, for example, you have dozens of nodes that all mount the same GFS2 file system and all access the same file, then access will only be fast if all nodes have read-only access (nodes mounted with the noatime mount option). As soon as there is one writer to the shared file, the performance will slow down dramatically. If the application knows when it has written a file that will not be used again on the local node, then calling fsync and then fadvise/madvise with the DONT_NEED flag will help to speed up access from other cluster nodes. The other important item to note is that for directories, file create/unlink activity has the same effect as a write to a regular file: it requires exclusive access to the directory to perform the operation and then subsequent access from other nodes requires rereading the directory information into cache, which can be a slow operation for large directories. It is usually better to split up directories with a lot of write activity into several subdirectories that are indexed by a hash or some similar system in order to reduce the amount of times each individual directory has to be reread from disk. Resource-Group Contention GFS/GFS2 file systems are logically divided into several areas known as resource groups. The size of the resource groups can be controlled by the mkfs.gfs/mkfs.gfs2 command (-r parameter). The GFS/GFS2 mkfs program attempts to estimate an optimal size for your resource groups, but it might not be precise enough for optimal performance. If you have too many resource groups, the nodes in your cluster might waste unnecessary time searching through tens of thousands of resource groups trying to find one suitable for block allocation. On the other hand, if you have too few resource groups, each will cover a larger area, so block allocations might suffer from the opposite problem: too much time wasted in glock contention waiting for available resource groups. You might want to experiment with different resource-group sizes to find one that optimizes system performance. Block-Size Considerations When the file system is formatted with the mkfs.gfs/mkfs.gfs2 command, you may specify a block size with -b. If no size is specified, the default is 4K. Different block sizes will often provide different performance characteristics for your application. Most hardware is designed to operate efficiently with the default block size of 4K. Using the default 4k block size is recommended for all file systems. However, if there is a requirement for efficient storage of very small files, 1k should be considered the minimum block size (-b 1024). Ideal block size depends on how the file system is used. You might want to experiment with different block sizes to find one that optimizes system performance. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 3
  • 4. MOUNT OPTIONS Unless atime support is essential, Red Hat recommends setting noatime on every GFS/GFS2 mount point. This will significantly improve performance since it prevents reads from turning into writes. With GFS2 on Red Hat Enterprise Linux 6 and later, there is also the option of relatime, which updates atime when other timestamps are being updated; however, noatime is still recommended. Do not use the journaled data mode (chattr +j) unless it is required. The default ordered-data mode will prevent any uninitialized data from appearing in files after a crash. To ensure data is actually on disk, use fsync(2) for both the parent directory and newly created files. ANSWERS TO COMMON QUESTIONS If your cluster is running slowly or appears to be stopped and you are not sure why, the steps below should help to resolve the issue. Remember that Red Hat Support is always available to help. First Steps Begin by collecting the answers to several simple questions that Red Hat Support will ask when you contact them: • What is the workload on the cluster? • What applications are running, and are they using large or small files? • How are the files organized? • What is the architecture of the cluster? • How many nodes? • What is the storage, and how is it attached? • How large is the file system(s)? • What are the timing constraints? • Does the issue always occur at a certain time of day or have some relationship with a particular event (for example, nightly backups)? • Is the problem a performance problem (slow), a real bug (completely stuck, kernel panic, file-system assertion), or a corruption issue (usually indicated by a file-system withdraw)? • Is the problem reproducible, or did it happen only once? • Does the problem occur on a single node or in a cluster situation? • Does the problem occur on the same node, or does it move around? Of course, not every situation will require the same set of information, but the answers to these questions will give you a good idea of where to start looking for the root of the problem. If the problem always occurs at a specific time, look for periodic processes that might be running (not all are in crontab, but that is a good place to begin). Is a Task Stuck or Just Slow? It is often difficult to tell if a task is completely stuck or if it is just slow, but there are signs that point to poor performance resulting from contention for glocks. One example is increased network traffic. A significant amount of DLM network traffic indicates that there is a lot of locking activity and thus potentially a lot of How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 4
  • 5. cache invalidation. The function of a lock depends on each individual situation, and locking should be assessed as a proportion of the total network bandwidth instead of measured against specific metrics. Also, increased locking activity is only an indication of a problem and not a guarantee that one exists. The actual level of locking activity is highly dependent on the workload. Information from glock dumps can be used to show whether tasks are still making progress. Take two glock dumps, spaced apart by a few seconds or a few minutes, and then look for glocks with a lot of waiting holders (ignoring granted holders). If the same glocks have exactly the same list of waiting holders in the second glock dump as they did in the first, it is an indication that the cluster has become stuck. If the list has changed at all, then the cluster is just running slowly. Sometimes taking a glock dump can take a long time due to the amount of data involved, which depends on the number of cached glocks (GFS/GFS2 keep a large number of glocks in memory for performance purposes). The time needed will change depending on the total memory size of the node in question and the amount of GFS/GFS2 activity that has taken place on that node. GFS2 tracepoints (available in Red Hat Enterprise Linux 6 and later) can also be used to monitor activity in order to see whether any nodes are stuck. Which Task is Involved in the Slowdown? The glock dump file includes details of the tasks that have requested each glock (the owner of each holder). This information can be used to find out which task is stuck or involved in contention for a glock. Which Inode is Contended? Glock numbers are made up of two parts. In the glock dump, glock numbers are represented as type, number. Type 2 indicates an inode, and type 3 indicates a resource group. There are additional types of glocks, but the majority of slowdown and glock-contention issues will be associated with these two glock types. The number of the type 2 glocks (inode glocks) indicates the disk location (in file system blocks) of the inode and also serves as the inode identification number. You can convert the inode number listed in the glock dump from hexadecimal to decimal format and use it to track down the inode associated with that glock. Identifying the contended inode should be possible using find -inum, preferably on an otherwise idle file system since it will try to read all the inodes in the file system, making any contention problem worse. Why Does gfs2_quotad Appear Stuck When I'm Not Even Using Quotas? The gfs2_quotad process performs two jobs. One of those is related to quotas, and the other is updating the statfs information for the file system. If a problem occurs elsewhere in the file system, gfs2_quotad often becomes stuck since the periodic writes to update the statfs information can become queued behind other operations in the system. If gfs2_quotad appears stuck, it is usually a symptom of a different problem elsewhere in the file system. Is It Worth Trying to Reproduce a Problem While Only a Single Node is Mounted? In almost every case, you should try to reproduce a problem while only a single node is mounted. If the How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 5
  • 6. problem does reproduce on a single node, it is probably not related to clustering at all. If the problem only appears in the cluster, it indicates either an I/O issue or a contention issue on one or more inodes in the cluster. How Can I Calculate the Maximum Throughput of GFS/GFS2 on My Hardware? Maximum throughput depends on the I/O pattern from the application, the I/O scheduler on each node, the distribution of I/O among the nodes, and the characteristics of the hardware itself. One simple example involves two nodes, each performing streaming writes to its own file on a GFS2 file system. This scenario can be simulated at the block-device level by creating two streams of I/O to different parts of the shared block device using dd. This test will allow you to measure the absolute maximum performance that the hardware can sustain. Actual file-system performance will differ due to the overhead of block allocation and file-system metadata, but this test will provide an upper limit. In this example, we are assuming that the block device is a single, shared, rotational hard disk that is able to write to only a single sector at once. The two streams of I/O will be sent to the disk by the I/O schedulers on the two different nodes, each without any knowledge of the other. The disk must perform scheduling in order to move the disk head between the two areas of the disk receiving the streams. This process will be slow, and it might even be slower than writing the two streams of I/O sequentially from a single node. If, on the other hand, we assume that the block device in the example is a RAID array with many spindles, then the two streams of I/O may be written at the same time without having to move disk heads between the two areas. This will improve performance. Storage hardware must be specified according to the expected I/O patterns from the nodes so that it can be capable of delivering the level of performance that will support file-system requirements. What About I/O Barriers? Beginning with Red Hat Enterprise Linux 6, GFS2 uses I/O barriers by default when flushing the log. Red Hat recommends the use of I/O barriers for most block devices; however, barriers are not required in all cases and can sometimes be detrimental to performance, depending on how the storage device implements them. If the shared block device has no write cache or if the write cache is not volatile (for example, if it is powered from a UPS or similar device), then you might wish to turn off barrier support. You can prevent the use of I/O barriers by setting the nobarrier option with the mount command (or in /etc/fstab). If the underlying storage does not support them, I/O barriers will automatically be turned off, indicated by a log message and the appearance of the nobarrier option in /proc/mounts as if it had been set using the command line. GFS2 only issues a single barrier each time it flushes the log. The total number of barriers issued over time can be minimized by reducing the number of operations that result in a log flush (for example, fsync(2) or glock contention) or (depending on workload) increasing the journal size. This has potential side-effects, so you should attempt to strike a balance between performance and the potential for data loss in the event of a node failure. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 6
  • 7. Can I Use Discard Requests for Thin-Provisioning? On Red Hat Enterprise Linux 6 and later, GFS2 supports the generation of discard requests. These requests allow the file system to tell the underlying block device which blocks are no longer required due to deallocation of a file or directory. Sending a discard request implies an I/O barrier. There is a small performance penalty when these requests are generated by GFS2, and the performance penalty might be larger depending on how the underlying storage device interprets the requests. In order to increase performance and decrease overhead, GFS2 saves up as many requests as possible and merges them into a single request whenever it is able. In order to turn on this feature, you need to set the discard option with the mount command. This feature will only work when both the volume manager and the underlying storage device support the requests. Some storage devices might consider the requests a suggestion rather than a requirement (for example, if the file system requests a discard of a single block, but the underlying storage is only able to discard larger chunks of storage). Other storage devices might not deallocate any storage at all, but they might use the suggestion to implement a secure delete by zeroing out the selected blocks. Are there Benefits to Using Solid-State Storage with GFS/GFS2? In general, solid-state storage has a much lower seek time, which can improve overall system performance, particularly where glock contention is a major factor. Is Network Traffic a Major Factor in GFS/GFS2 Performance? Network traffic should not greatly affect overall system performance, provided that there is enough bandwidth for the cluster infrastructure to keep quorum and synchronize any POSIX locks, that multicast is working between all the nodes, and that DLM is able to communicate its lock requests. However, if a network device is shared between storage and/or application traffic as well as cluster traffic, Red Hat recommends using tc to implement suitable bandwidth limits. The use of jumbograms is not recommended unless the network is carrying storage traffic. Latency is generally regarded as more important than overall throughput in terms of GFS/GFS2 performance. Watching traffic levels to ensure that the links do not saturate is a sensible policy since that can be a warning sign of other issues (such as contention), but traffic statistics otherwise do not provide a great deal of useful information. What if I Have a Different Problem? Red Hat Support representatives are available to help if you cannot find the answer to your question here. If you have experienced a kernel Oops, assertion, or similar problem, contact Red Hat Support immediately. If you have experienced a file-system withdraw, it is almost always due to corruption of the file system, and fsck.gfs/fsck.gfs2 can usually fix the problem. Unmount the file system on all nodes, take a backup, and then run fsck on the file system. Keep any output from fsck, as it will be needed in the event that fsck cannot fix the problem. Contact Red Hat Support if fsck fails to resolve the problem. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 7
  • 8. APPLICATIONS Email (imap/sendmail/etc.) Locality of access often causes problems when email is run on clustered file systems. To optimize performance, Red Hat recommends arranging for users to have a "home" node to which each user connects by default, assuming that all the cluster nodes are working normally. If a problem occurs in the cluster, users can then be moved to another home node where all of their files are cached. This reduces cross-node invalidation. It is also possible to use the same technique on the delivery side by setting up MTA to forward to the user's home node and letting that node write to the file system. If that node isn't available, then MTA can write the message directly. Using maildir instead of mbox also helps scalability since each message, rather than the whole mailbox, has its own lock. When performance issues occur in maildir setups, they are almost always the result of contention on the directory lock. Backup Backup can affect performance since the process of backing up a node or set of nodes usually involves reading the entire file system in sequence. If a single node performs the backup, that node will retain all that information in cache until other nodes in the cluster start requesting locks. While running this type of backup program while the cluster is in operation is a sure way to reduce system performance, there are a number of ways to reduce the detrimental effect. One way is to drop the caches using echo -n 3 >/proc/sys/vm/drop_caches after the backup has completed. This reduces the time required by other nodes to get their glocks/caches back. However, this method is not ideal because the other nodes will have stopped caching the data that they were caching before the backup process started. Also, there is an effect on the overall cluster performance while the backup is running, which is often not acceptable. Another method is to back up at the block-device level and take a snapshot. This is currently only supported in cases where there is provision for a snapshot at the hardware (storage array) level. A better solution is to back up the working set of each cluster node from the node itself. This distributes the workload of the backup across the cluster and keeps the working set in cache on the nodes in question. This often requires custom scripting. Web Servers GFS/GFS2 file systems are ideally suited as web servers since serving web pages tends to involve a large amount of data that can be cached on all nodes. Issues can arise when data has to be updated, but you can reduce the potential for contention by preparing a new copy of the website and switching over rather than trying to update the files in place. Red Hat recommends making the root of the website a bind mount and using mount --move to have the web server(s) use a new set of files. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 8
  • 9. SYSTEM CALLS Read/Write Read/write performance should be acceptable for most applications, provided you are careful not to cause too many cross-node accesses that require cache sync and/or invalidation. Streaming writes on GFS2 are currently slower than on GFS. This is a direct consequence of the locking hierarchy and results from GFS2 performing writes on a per-page basis like other (local) file systems. Each page written has a certain amount of overhead. Due to the different lock ordering, GFS does not suffer from the same problem since it is able to perform the overhead operations once for multiple pages. Speed of multiple-page write calls aside, there are many advantages to the GFS2 file system, including faster performance for cached reads and simpler code for deadlock avoidance during complicated write calls (for example, when the source page being written is from a memory-mapped file on a different file system type). Red Hat is currently working to allow multiple-page writes, which will make GFS2’s streaming write calls faster than the equivalent GFS operation. This streaming-writes issue is the only known regression in speed between GFS and GFS2. Smaller writes (page sized and below) on GFS2 are faster than on GFS. Memory Mapping GFS and GFS2 implement memory mapping differently. In GFS (and some earlier GFS2 kernels), a page fault on a writable shared mapping would always result in an exclusive lock being taken for the inode in question. This is consequence of an optimization that was originally introduced for local file systems where pages would be made writable on the initial page fault in order to avoid a potential second fault later (if the first access was a read and a subsequent access was a write). In Red Hat Enterprise Linux 6 (and some later versions of Red Hat Enterprise Linux 5) kernels, GFS2 has implemented a system of only providing a read-only mapping for read requests, significantly improving scalability. A file that is mapped on multiple nodes of a GFS2 cluster in a shared writable manner can be cached on all nodes, provided no writes occur. NOTE: While in theory you can use the feature of shared writable memory mapping on a single shared file to implement distributed shared memory, any such implementation would be very slow due to cache-bouncing issues. This is not recommended. Sharing a read-only shared file in this way is acceptable, but only on recent kernels with GFS2. If you need to share a file in this way on GFS, then open and map it read only to avoid locking issues. Cache Control (fsync/fadvise/madvise) Both GFS and GFS2 support fsync(2), which functions the same way as in any local file system. When using fsync(2) with numerous small files, Red Hat recommends sorting them by inode number. This will improve performance by reducing the disk seeks required to complete the operation. If it is possible to save up fsync(2) on a set of files and sync them all back together, it will help performance when compared with using either O_SYNC or fsync(2) after each write. To improve performance with GFS2, you can use the fadvise and/or madvise pair of system calls to request read ahead or cache flushing when it is known that data will not be used again (GFS does not How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 9
  • 10. support the fadvise/madvise interface). Overall performance can be significantly improved by flushing the page cache for an inode when it will not be used from a particular node again and is likely to be requested by another node. It is also possible to drop caches globally. Using echo -n 3 >/proc/sys/vm/drop_caches will drop the caches for all file systems on a node and not just GFS/GFS2. This can be useful when you have a problem that might be caused by caching and you want to run a cache cold test, for example. However, it should not be used in the normal course of operation (see the Backup section above). File Locking The locking methods below are only recommendations as GFS and GFS2 do not support mandatory locks. flock The flock system call is implemented by type 6 glocks and works across the cluster in the normal way. It is affected by the localflocks mount option, as described below. Flocks are a relatively fast method of file locking and are preferred to fcntl locks on performance grounds (the difference becomes greater on clusters with larger node counts). fcntl (POSIX Locks) POSIX locks have been supported on a single-node basis since the early days of GFS. The ability to use these locks in a clustered environment was added in Red Hat Enterprise Linux 5. Unlike the other GFS/GFS2 locking implementations, POSIX locks do not use DLM and are instead performed in user space via corosync. By default, POSIX locks are rate limited to a maximum of 100 locks per second in order to conserve network bandwidth that might otherwise be flooded with POSIX-lock requests. To raise the limit, you can edit the cluster.conf file (setting the limit to 0 removes it altogether). NOTE: Some applications using POSIX locks might try to use F_GETLK fcntl to try to obtain the PID of a blocking process. This will work on GFS/GFS2, but due to clustering, the process might not be on the same node as the process that used F_GETLK. Sending a signal is not as straightforward in this case, and care should be taken not to send signals to the wrong processes. It is possible to make use of POSIX locks on a single-node basis by setting the localflocks mount option. This also affects flock(2), but it is not usually a problem since it is unusual to require both forms of locking for a single application. NOTE: Localflocks must be set for all NFS-exported GFS2 file systems, and the only supported NFS-overGFS/GFS2 solutions are those with only a single active NFS server at a time designed for active/passive failover. NFS is not currently supported in combination with either Samba or local applications. Due to the user-space implementation of POSIX locks, they are not suitable for high-performance locking requirements. Leases Leases are not supported on either GFS or GFS2. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 10
  • 11. DLM There is no reason why applications should not make use of the DLM directly. Interfaces are available, and details can be found in the DLM documentation. RECOMMENDED TUNABLE SETTINGS The following sections describe recommended values for GFS tunable parameters. glock_purge In Red Hat Enterprise Linux 4.6/5.1 and later, a GFS tunable parameter, glock_purge, has been added to reduce the total number of locks cached for a particular file system on a cluster node. NOTE: This setting does not exist in Red Hat Enterprise Linux 6 or later, and it is not a recommended solution to any problem for which there is another solution. In Red Hat Enterprise Linux 6 and later, this parameter is self tuning, and caching can be controlled from the userspace via the fsync/fadvise system calls as described earlier in this document. This tunable parameter defines the percentage of unused glocks for a file system to clear every five seconds, as shown below, where X is an integer between 0 and 100 indicating the percentage to clear. # gfs_tool settune /path/to/mount glock_purge X A setting of 0 disables glock_purge. This value is typically set somewhere between 30-60 to start and can be further tuned based on testing and performance benchmarks. This setting is not persistent, so it must be reapplied every time the file system is mounted. It is typically placed in /etc/rc.local or /etc/init.d/gfs in the start function on every node so that it is applied at boot time after the file systems are mounted. demote_secs Another tunable parameter, demote_secs, can be used in conjunction with glock_purge. This tunable parameter demotes GFS write locks into less restricted states and subsequently flushes the cache data into disk. Shorter demote second(s) can be used to avoid accumulation of too much cached data, resulting in burst-mode flushing activities and prolonging another node’s lock access. NOTE: This setting does not exist in Red Hat Enterprise Linux 6 or later, and it is not a recommended solution to any problem for which there is another solution. In Red Hat Enterprise Linux 6 and later, this parameter is self tuning, and caching can be controlled from the userspace via the fsync/fadvise system calls as described earlier in this document. The default value is 300 seconds. To enable the demoting every 200 seconds on mount point /mnt/gfs1, enter the following command: $ gfs_tool settune /mnt/gfs1 demote_secs 200 To set back to default of 300 seconds, enter the following command: $ gfs_tool settune /mnt/gfs1 demote_secs 300 Note that this setting only applies to an individual file system, so multiple commands must be used to apply it How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 11
  • 12. to more than one mount point. statfs_fast The statfs_fast tunable parameter can be used in Red Hat Enterprise Linux 4.5 or later to speed up statfs calls on GFS. NOTE: For Red Hat Enterprise Linux 6 and later, this can be set via the mount command line using the statfs_quantum and statfs_percent mount arguments. This is the preferred method since it is then set at mount time and does not require a separate tool to change it. To enable statfs_fast, enter the following command: # gfs_tool settune /path/to/mount statfs_fast 1 Red Hat recommends the use of mount options noquota, noatime, and nodiratime for GFS file systems, if possible, as they are known to improve performance in many cases. They can be added in /etc/fstab, as shown below. /dev/clustervg/lv1 /mnt/appdata gfs defaults,noquota,noatime,nodiratime 0 0 NOTE: An issue has been reported to Red Hat Engineering regarding the usage of noquota in Red Hat Enterprise Linux 5.3: Why do I get a mount error reporting 'Invalid argument' on my GFS or GFS2 file system on Red Hat Enterprise Linux 5.3?. Disabling ls Colors It might also be beneficial to remove the aliases for the ls command that cause it to display colors in its output when using the bash or csh shells. By default, Red Hat Enterprise Linux systems are configured with the following aliases from /etc/profile.d/colorls.sh and colorls.csh: # alias | grep 'ls' alias l.='ls -d .* --color=tty' alias ll='ls -l --color=tty' alias ls='ls --color=tty' In situations where a GFS file system is slow to respond, the first response of many users is to run ls in order to determine the problem. If the --color option is enabled, ls must run a stat() against every entry, which creates additional lock requests and can create contention for those files with other processes. This might exacerbate the problem and cause slower response times for processes accessing that file system. To prevent ls from using the --color=tty option for all users, the following lines can be added to the end of /etc/profile: alias ll='ls -l' 2>/dev/null alias l.='ls -d .*' 2>/dev/null unalias ls These lines can also be placed in a user’s ~/.bash_profile to disable --color=tty on an individual basis. In general, however, it is best to avoid excessive use of the ls command due to the locking overhead. How to Improve GFS/GFS2 File System Performance | Ruemker, Bradley, and Whitehouse 12 Copyright © 2011 Red Hat, Inc. “Red Hat,” Red Hat Linux, the Red Hat “Shadowman” logo, and the products listed are trademarks of Red Hat, Inc., registered in the U.S. and other countries. Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries. www.redhat.com