The document discusses parallel file systems for Linux clusters. It describes how parallel file systems distribute data across multiple storage servers to enable high-performance access through simultaneous input/output operations. This allows each process on every node in a Linux cluster to perform I/O to and from a common storage target. Examples of parallel file systems for Linux clusters include PVFS, IBM GPFS, and Lustre. Parallel file systems enhance the performance of Linux clusters by optimizing the use of storage resources.
1. Parallel File System for Linux Clusters
1
1. ABSTRACT
The trend in parallel computing is to move away from traditional specialized supercomputing
platforms, such as the Cray Jaguar, Cray Titan, IBM Summit, to cheaper and general-purpose
systems consisting of loosely coupled components built up from single or multiprocessor PCs
or workstations.
This approach has number of advantages, including being able to build a platform for a given
budget that is suitable for a large class of
applications and workload.
Linux clusters have matured as platforms for low-cost, high-performance parallel computing,
especially in areas such as message passing and networking.
Parallel file systems are a critical piece of any Input/output (I/O)-intensive high-performance
computing system.
A parallel file system enables each process on every node to perform I/O to and from a
common storage target. With more and more sites adopting Linux clusters for high
performance computing, the need for high performing I/O on Linux is increasing.
2. Parallel File System for Linux Clusters
2
2. INTRODUCTION
Parallel File System
A parallel file system is a software component designed to store data across multiple
networked servers and to facilitate high-performance access through simultaneous,
coordinated input/output operations (IOPS) between clients and storage nodes.
IOPS (input/output operations per second) is the standard unit of measurement for the
maximum number of reads and writes to non-contiguous storage locations. IOPS is
pronounced EYE-OPS.
IOPS is frequently referenced by storage vendors to characterize performance in solid-state
drives (SSD), hard disk drives (HDD) and storage area networks.
A parallel file system breaks up a data set and distributes, or stripes, the blocks to multiple
storage drives, which can be located in local and/or remote servers.
Disk striping is the process of dividing a body of data into blocks and spreading the data
blocks across multiple storage devices, such as hard disks or solid-state drives (SSDs). A
stripe consists of the data divided across the set of hard disks or SSDs, and a striped unit, or
strip, that refers to the data slice on an individual drive.
Users do not need to know the physical location of the data blocks to retrieve a file. The
system uses a global namespace to facilitate data access. Parallel file systems often use
a metadata server to store information about the data, such as the file name, location and
owner.
Global namespace is a feature that simplifies storage management in environments that have
numerous physical file systems.
A global namespace provides a consolidated view into multiple Network File Systems (NFS),
Common Internet File Systems (CIFS), network-attached storage (NAS) systems or file
servers that are in different physical locations. This is particularly beneficial in distributed
implementations with unstructured data and in environments that are growing quickly so that
3. Parallel File System for Linux Clusters
3
data can be accessed without needing to know where it physically resides. Without a
global namespace, these multiple file systems would have to be managed separately.
Metadata is data that describes other data. Meta is a prefix that in most information
technology usages means "an underlying definition or description.
Metadata summarizes basic information about data, which can make finding and working
with instances of data easier. For example, author, date created, and date modified, and file
size are examples of very basic document metadata.
A parallel file system reads and writes data to distributed storage devices using multiple I/O
paths concurrently, as part of one or more processes of a computer program. The coordinated
use of multiple I/O paths can provide a significant performance benefit, especially when
streaming workloads that involve large number of clients.
Capacity and bandwidth can be scaled to accommodate enormous quantities of data. Storage
features may include high availability, mirroring, replication and snapshots.
4. Parallel File System for Linux Clusters
4
3. COMMON USE CASES OF PARALLEL FILE SYSTEMS
Parallel file systems historically have targeted high-performance computing (HPC)
environments that require access to large files, massive quantities of data or simultaneous
access from multiple compute servers. Applications include climate modeling, computer-
aided engineering, exploratory data analysis, financial modeling, genomic sequencing,
machine learning and artificial intelligence, seismic processing, video editing and visual
effects rendering.
Users of parallel file systems span national laboratories, government agencies and
universities, as well as industries such as financial services, life sciences, manufacturing,
media and entertainment, and oil and gas.
Parallel file system implementations may span thousands of servers nodes and manage
petabytes or exabytes of data. Users typically deploy high-speed networking such as fast
Ethernet, InfiniBand, or proprietary technologies to optimize the I/O path and enable greater
bandwidth.
5. Parallel File System for Linux Clusters
5
4. PARALLEL FILE SYSTEM VS. DISTRIBUTED FILE SYSTEM
A parallel file system is a type of distributed file system. Both distributed and parallel file
systems can spread data across multiple storage servers, scale to accommodate petabytes of
data, and support high bandwidth.
Distributed file systems typically support a shared global namespace, as parallel file systems
do. But with a distributed file system, all client systems accessing a given portion of the
namespace generally go through the same storage node to access the data and metadata, even
if parts of the file are stored on other servers. With a parallel file system, the client systems
have direct access to all storage nodes for data transfer without having to go through a single
coordinating server.
Additional distinctions may include:
A distributed file system generally uses a standard network file access protocol, such
as NFS or SMB, to access a storage server. A parallel file system generally requires
the installation of client-based software drivers to access the shared storage via high-
speed networks such as Ethernet, InfiniBand, and Omni-Path.
A distributed file system often stores a file on a single storage node, whereas a
parallel file system generally breaks up the file and stripes the data blocks across
multiple storage nodes.
Distributed file system deployments can store data on the application servers or
centralized servers, while typical parallel file system deployments separate the
compute and storage servers for performance reasons.
Distributed file systems tend to target loosely coupled, data-heavy applications or
active archives. Parallel file systems focus on high-performance workloads that can
benefit from coordinated I/O access and significant bandwidth.
Distributed file systems often use techniques such as three-way replication or erasure
coding to provide fault tolerance in the software, whereas many parallel file systems
run on shared storage.
6. Parallel File System for Linux Clusters
6
5. EXAMPLE OF PARALLEL FILE SYSTEM
Parallel Virtual File System (PVFS)
PVFS is an open source file system for Linux-based clusters developed and supported by the
Parallel Architecture Research Laboratory at Clemson University and the Mathematics and
Computer Science Division at Argonne National Laboratory.
IBM General Parallel File System (GPFS)
IBM General Parallel File System (IBM GPFS) is a file system used to distribute and
manage data across multiple servers, and is implemented in many high-performance
computing and large-scale storage environments.
LUSTRE
Lustre is a type of parallel file system, generally used for large-scale cluster computing. The
name Lustre is derived from Linux and cluster. Lustre file system software is available under
the GNU General Public License and provides high performance file systems for computer
clusters ranging in size from small workgroup clusters to large-scale, multi-site clusters.
7. Parallel File System for Linux Clusters
7
6. LINUX CLUSTERS
Linux is a free open source Operating System for computers that was originally developed in
1991 by Linus Torvalds, a Finnish undergraduate student.
An operating system is an interface between the user of a computer and the computer
hardware. It is a collection of software that manages computer hardware resources and offers
common services for programs of the computer.
The open source nature of Linux means that the source code for the Linux kernel is freely
available so that anyone can add features and correct deficiencies. The open source approach
has not just successfully been applied to kernel code, but also to application programs for
Linux.
As Linux has become more popular, several different development streams or distributions
have emerged, e.g. Redhat, Suse, Debian, Ubuntu etc. A distribution comprises a pre-
packaged kernel, system utilities, GUI interfaces and application programs.
ARCHITECTURE OF THE LINUX OS:
The Linux Operating System’s architecture primarily has these components: The
Kernel, Hardware layer, System library, Shell and System utility.
8. Parallel File System for Linux Clusters
8
The kernel is the core part of the operating system, which is responsible for all the major
activities of the LINUX operating system.
System libraries are special functions, that are used to implement the functionality of the
operating system and do not require code access rights of kernel modules.
System Utility programs are liable to do individual, and specialized-level tasks.
Hardware layer of the LINUX operating system consists of peripheral devices such as RAM,
HDD, CPU.
The shell is an interface between the user and the kernel, and it affords services of the kernel.
It takes commands from the user and executes kernel’s functions. The Shell is present in
different types of operating systems, which are classified into two types:
1 command line shells
2 graphical shells.
Linux Cluster is a collection of independent computer systems running identical Linux
operating system, working together as if a single system Coupled through a scalable, high
bandwidth, low latency interconnect.
9. Parallel File System for Linux Clusters
9
7. FUNCTIONS OF A PARALLEL FILE SYSTEM IN LINUX CLUSTER
Allow data stored in a single file to be physically distributed among I/O resources in
the cluster.
Any server in the cluster can access any block of storage managed by the cluster. This
allows the file system to break large files into blocks, and to stripe those extents
across different storage arrays to improve I/O performance.
10. Parallel File System for Linux Clusters
10
8. CONCLUSION
Parallel file systems enhance performance of a Linux clusters.
Parallel file system for Linux cluster Designed to optimize the use of storage.
Parallel file systems are under continual development and will continue to evolve increasing
functionality and performance.
11. Parallel File System for Linux Clusters
11
9. REFERENCES
https://www.linuxjournal.com/article/4354
https://www.usenix.org/conference/als-2000/pvfs-parallel-file-system-linux-clusters
High-Performance Computing: Paradigm and Infrastructure By Laurence T. Yang,
Minyi Guo