SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
A Survey of Clustered Parallel
    File Systems for High
  Performance Computing
           Clusters
                 James W. Barker, Ph. D.
              Los Alamos National Laboratory
   Computer, Computational and Statistical Sciences Division



                       Los Alamos National Laboratory
Definition of Terms
●   Distributed File System - The generic term for a client/server or
    "network" file system where the data is not locally attached to a host.
    ●   Network File System (NFS) is the most common distributed file system
        currently in use.
●   Storage Area Network (SAN) File System – Provides a
    means for hosts to share Fiber Channel storage, which is
    traditionally separated into private physical areas bound to different
    hosts. A block-level metadata manager manages access to different
    SAN devices. A SAN File system mounts storage natively on only
    one node and connects all other nodes to that storage by distributing
    the block address of that storage to all other nodes.
    ●   Scalability is often an issue due to the significant workload required of
        the metadata managers and the large network transactions required in
        order to access data.
    ●   Examples include: IBM’s General Parallel File System (GPFS) and
        Sistina (now Red Hat) Global File System (GFS)



                           Los Alamos National
                           Laboratory
Definition of Terms
●   Symmetric File Systems - A symmetric file system is one in
    which the clients also host the metadata manager code, resulting in
    all nodes understanding the disk structures.
    ●   A concern with these systems is the burden that metadata management
        places on the client node, serving both itself and other nodes, which can
        impact the ability of the client node to perform its intended
        computational jobs.
    ●   Examples include IBM’s GPFS and Red Hat GFS
●   Asymmetric File Systems - An asymmetric file system is a file
    system in which there are one or more dedicated metadata
    managers that maintain the file system and its associated disk
    structures.
    ●   Examples include Panasas ActiveScale, Lustre and traditional NFS file
        systems.



                           Los Alamos National
                           Laboratory
Definition of Terms
●   Cluster File System - a distributed file system that is not a single server
    with a set of clients, but a cluster of servers that all work together to provide
    high performance storage service to their clients.
     ●   To the clients the cluster file system is transparent, it is simply "the file system",
         but the file system software manages distributing requests to elements of the
         storage cluster.
     ●   Examples include: Hewlett-Packard Tru64 cluster and Panasas ActiveScale
●   Parallel File System - a parallel file system is one in which data blocks
    are striped, in parallel, across multiple storage devices on multiple storage
    servers. Support for parallel applications is provided allowing all nodes
    access to the same files at the same time, thus providing concurrent read
    and write capabilities.
     ●   Network Link Aggregation, another parallel file system technique, is the
         technology used by PVFS2, in which the I/O is spread across several network
         connections in parallel, each packet taking a different link path from the previous
         packet.
     ●   Examples of this include: Panasas ActiveScale, Lustre, PVFS2, GPFS and GFS.




                                Los Alamos National
                                Laboratory
Definition of Terms
●   An important note: all of the above definitions overlap. A SAN file
    system can be symmetric or asymmetric. Its servers may be
    clustered or single servers. And it may support parallel applications
    or it may not.
    ●   For example; the Panasas Storage Cluster and its ActiveScale File
        System (a.k.a. PanFS) is a clustered (many servers share the work),
        asymmetric (metadata management does not occur on the clients),
        parallel (supports concurrent reads and writes), object-based (not block-
        based) distributed (clients access storage via the network) file system.
    ●   Another example; the Lustre File System is also a clustered,
        asymmetric, parallel, object-based (referred to as targets by Lustre),
        distributed file system.
    ●   Another example, the Parallel Virtual File System 2 (PVFS2) is a
        clustered, symmetric, parallel, aggregation-based, distributed file
        system.
    ●   And finally; the Red Hat Global File System (GFS) is a clustered,
        symmetric, parallel, block-based, distributed file system.



                           Los Alamos National
                           Laboratory
Object Storage Components
●   An Object contains the data and enough additional information to allow the
    data to be autonomous and self-managing.
●   An Object-based Storage Device (OSD) is an intelligent evolution of the disk
    drive capable of storing and serving objects rather then simply coping data
    to tracks and sectors. (The term OSD does not exist in Lustre)
     ●   The term OSD in Panasas = The term OST in Lustre
     ●   An Object-based Storage Target (OST) is an abstraction layer above the physical
         blocks of a physical disk (in Panasas terminology, not in Lustre).
     ●   An Object-Based Disk (OBD) is an abstraction of the physical blocks of the
         physical disks (in Lustre terminology, OBD’s do not exist in Panasas
         terminology).
●   An Installable File System (IFS) integrates with compute nodes, accepts
    POSIX file system commands and data from the Operating System,
    addresses the OSD’s directly and stripes the objects across multiple OSD’s.
●   A Metadata Server intermediates throughout multiple compute nodes in the
    environment, allowing them to share data while maintaining cache
    consistency on all nodes.
●   The Network Fabric ties the compute nodes to the OSD’s and metadata
    servers.



                             Los Alamos National
                             Laboratory
Storage Objects
●   Each file or directory can be thought of as an object. As
    with all objects, storage objects have attributes.
●   Each storage object attribute can be assigned a value
    such as file type, file location, whether the data is striped
    or not, ownership, and permissions.
    ●   An object storage device (OSD) allows us to specify for each file
        where to store the blocks allocated to the file, via a metadata
        server and object storage targets.
●   Extending the storage attribute further, it can also be
    specified how many object storage targets to stripe onto
    and what level of redundancy to employ on the target.
    ●   Some implementations (Panasas) allow the specification of RAID
        0 (striped) or RAID 1 (mirrored) on a per-file basis.


                         Los Alamos National
                         Laboratory
Panasas
●   Within the storage device, all
    objects are accessed via a 96-bit
    object ID. The object is accessed
    based on the object ID, the
    beginning of the range of bytes
    inside the object and the length of
    the byte range that is of interest
    (<objectID, offset, length>).
●   There are three different types of
    objects:
     ●   The “Root” object on the storage
         device identifies the storage
         device and various attributes of
         the device; including total capacity
         and available capacity.
     ●   A “Group” object provides a
         “directory” to a logical subset of
         the objects on the storage device.
     ●   A ”User” object contains the actual
         application data to be stored.



                               Los Alamos National
                               Laboratory
Panasas
●   The “User” object is a container for data and two types of attributes:
    ●   Application Data is essentially the equivalent of the data that a file would
        normally have in a conventional file system. It is accessed with file-like
        commands such as Open, Close, Read and Write.
    ●   Storage Attributes are used by the storage device to manage the block
        allocation for the data. This includes the object ID, block pointers,
        logical length and capacity used. This is similar to the inode-level
        attributes inside a traditional file system.
    ●   User Attributes are opaque to the storage device and are used by
        applications and metadata managers to store higher-level information
        about the object.
         ●   These attributes can include; file system attributes such as ownership and
             access control lists (ACL’s), Quality of Service requirements that apply to a
             specific object and how the storage system treats a specific object (i.e., what
             level of RAID to apply, the size of the user’s quota or the performance
             characteristics required for that data).




                              Los Alamos National
                              Laboratory
Panasas
●   The Panasas concept of object
    storage is implemented entirely
    in hardware.
●   The Panasas ActiveScale File
    System supports two modes of
    data access:
    ●   DirectFLOW is an out of band
        solution enabling Linux Cluster
        nodes to directly access data
        on StorageBlades in parallel.
    ●   NFS/CIFS operates in band,
        utilizing the DirectorBlades as
        a gateway between NFS/CIFS
        clients and StorageBlades.




                           Los Alamos National
                           Laboratory
Panasas Performance




●   Random I/O - SPECsfs97_R1.v3 as measured by Standard
    Performance Evaluation Corporation (www.spec.org) a Panasas
    ActiveScale storage cluster produced a peak of 305,805 random I/O
    Operations/Second.
●   Data Throughput – as measured “in-house” by Panasas on a
    similarly configured cluster delivered a sustained 10.1 GBytes/
    Second on sequential I/O read tests.

                       Los Alamos National
                       Laboratory
Lustre
●   Lustre is an open, standards-based technology that runs on commodity
    hardware and uses object-based disks for storage and metadata servers for
    file system metadata.
     ●   This design provides an efficient division of labor between computing and
         storage resources.
●   Replicated, failover MetaData Servers (MDSs) maintain a transactional
    record of high-level file and file system changes.
●   Distributed Object Storage Targets (OSTs) are responsible for actual file
    system I/O and for interfacing with storage devices.
     ●   File operations bypass the metadata server completely and utilize the parallel
         data paths to all OSTs in the cluster.
●   Lustre’s approach of separating metadata operations from data operations
    results in enhanced performance.
     ●   The division of metadata and data operations creates a scalable file system with
         greater recoverability from failure conditions by providing the advantages of both
         journaling and distributed file systems.




                              Los Alamos National
                              Laboratory
Lustre
●   Lustre supports strong file and metadata locking
    semantics to maintain coherency of the file systems even
    under a high volume of concurrent access.
●   File locking is distributed across the Object Storage
    Targets (OSTs) that constitute the file system, with each
    OST managing locks for the objects that it stores.
●   Lustre uses an open networking stack composed of
    three layers:
    ●   At the top of the stack is the Lustre request processing layer.
    ●   Beneath the Lustre request processing layer is the Portals API
        developed by Sandia National Laboratory.
    ●   At the bottom of the stack is the Network Abstraction Layer
        (NAL) which is intended to provide out-of-the-box support for
        multiple types of networks.



                         Los Alamos National
                         Laboratory
Lustre
●   Lustre provides security in the form of authentication,
    authorization and privacy by leveraging existing security
    systems.
    ●   This eases incorporation of Lustre into existing enterprise
        security environments without requiring changes to Luster.
●   Similarly, Lustre leverages the underlying journaling file
    systems provided by Linux
    ●   These journaling file systems enable persistent state recovery
        providing resiliency and recoverability from failed OST’s.
●   Finally, Lustre’s configuration and state information is
    recorded and managed using open standards such as
    XML and LDAP
    ●   Easing the task of integrating Lustre into existing environments
        or third-party tools.



                         Los Alamos National
                         Laboratory
Lustre
●   Lustre technology is
    designed to scale while
    maintaining resiliency.
    ●   As servers are added to a
        typical cluster environment,
        failures become more likely
        due to the increasing
        number of physical
        components.
    ●   Lustre’s support for
        resilient, redundant
        hardware provides
        protection from inevitable
        hardware failures through
        transparent failover and
        recovery.


                         Los Alamos National
                         Laboratory
Lustre File System Abstractions
●   The Lustre file system provides several
    abstractions designed to improve both
    performance and scalability.
    ●   At the file system level, Lustre treats files
        as objects that are located through
        metadata Servers (MDSs).
    ●   Metadata Servers support all file system
        namespace operations:
         ●   These operations include file lookups, file
             creation and file and directory attribute
             manipulation. As well as directing actual
             file I/O requests to Object Storage Targets
             (OSTs), which manage the storage that is
             physically located on underlying Object-
             Based Disks (OBDs).
    ●   Metadata servers maintain a transactional
        record of file system metadata changes
        and cluster status, as well as supporting
        failover operations.


                                Los Alamos National
                                Laboratory
Lustre Inodes, OST’s & OBD’s
●   Like traditional file systems, the Lustre file system has a
    unique inode for every regular file, directory, symbolic
    link, and special file.
     ●   Creating a new file causes the client to contact a metadata
         server, which creates an inode for the file and then contacts the
         OSTs to create objects that will actually hold file data.
          ●   Metadata for the objects is held in the inode as extended attributes
              for the file.
     ●   The objects allocated on OSTs hold the data associated with the
         file and can be striped across several OSTs in a RAID pattern.
     ●   Within the OST, data is actually read and written to underlying
         storage known as Object-Based Disks (OBDs).
     ●   Subsequent I/O to the newly created file is done directly between
         the client and the OST, which interacts with the underlying OBDs
         to read and write data.
          ●   The metadata server is only updated when additional namespace
              changes associated with the new file are required.


                             Los Alamos National
                             Laboratory
Lustre Network Independence
●   Lustre can be used over a wide variety of
    networks due to its use of an open Network
    Abstraction Layer. Lustre is currently in use
    over TCP and Quadrics (QSWNet)
    networks.
     ●   Myrinet, Fibre Channel, Stargen and
         InfiniBand support are under development.
     ●   Lustre's network-neutrality enables Lustre to
         quickly take advantage of performance
         improvements provided by network
         hardware and protocol improvements
         offered by new systems.
●   Lustre provides unique support for
    heterogeneous networks.
     ●   For example, it is possible to connect some
         clients over an Ethernet to the MDS and
         OST servers, and others over a QSW
         network, in a single installation.




                                 Los Alamos National
                                 Laboratory
Lustre
●   One drawback to Lustre
    is that a Lustre client
    cannot run on a server
    that is providing OSTs.
●   Lustre has not been
    ported to support UNIX
    and Windows operating
    systems.
    ●   Lustre clients can and
        probably will be
        implemented on non-Linux
        platforms, but as of this
        date, Lustre is available
        only on Linux.

                        Los Alamos National
                        Laboratory
Lustre Performance
●   Hewlett-Packard (HP) and Pacific Northwest National
    Laboratory (PNNL) have partnered on the design,
    installation, integration and support of one of the top 10
    fastest computing clusters in the world.
●   The HP Linux super cluster, with more than 1,800
    Itanium® 2 processors, is rated at more than 11
    TFLOPS.
●   PNNL has run Lustre for more than a year and currently
    sustains over 3.2 GB/s of bandwidth running production
    loads on a 53-terabyte Lustre-based file share.
    ●   Individual Linux clients are able to write data to the parallel
        Lustre servers at more than 650 MB/s.



                          Los Alamos National
                          Laboratory
Luster Summary
●   Lustre is a storage architecture and distributed file system that
    provides significant performance, scalability, and flexibility to
    computing clusters.
●   Lustre uses an object storage model for file I/O, and storage
    management to provide an efficient division of labor between
    computing and storage resources.
    ●   Replicated, failover metadata Servers (MDSs) maintain a transactional
        record of high-level file and file system changes.
    ●   Distributed Object Storage Targets (OSTs) are responsible for actual file
        system I/O and for interfacing with local or networked storage devices
        known as Object-Based Disks (OBDs).
●   Lustre leverages open standards such as Linux, XML, LDAP, readily
    available open source libraries, and existing file systems to provide
    a scalable, reliable distributed file system.
●   Lustre uses failover, replication, and recovery techniques to
    minimize downtime and to maximize file system availability, thereby
    maximizing cluster productivity.


                           Los Alamos National
                           Laboratory
Storage Aggregation
●   Rather than providing scalable performance by striping
    data across dedicated storage devices, storage
    aggregation provides scalable capacity by utilizing
    available storage blocks on each compute node.
●   Each compute node runs a server daemon that provides
    access to free space on the local disks.
    ●   Additional software runs on each client node that combines
        those available blocks into a virtual device and provides locking
        and concurrent access to the other compute nodes.
    ●   Each compute node could potentially be a server of blocks and a
        client. Using storage aggregation on a large (>1000 node)
        cluster, 10’s of TB of free storage could potentially be made
        available for use as high-performance temporary space.



                         Los Alamos National
                         Laboratory
Parallel Virtual File System
                   (PVFS2)
●   Parallel Virtual File System 2 (PVFS2) is an open source
    project from Clemson University that provides a
    lightweight server daemon to provide simultaneous
    access to storage devices from hundreds to thousands
    of clients.
●   Each node in the cluster can be a server, a client, or
    both.
●   Since storage servers can also be clients, PVFS2
    supports striping data across all available storage
    devices in the cluster (e.g., storage aggregation) .
    ●   PVFS2 is best suited for providing large, fast temporary storage.




                         Los Alamos National
                         Laboratory
Parallel Virtual File System
                 (PVFS2)
●   Implicitly maintains consistency by carefully
    structuring metadata and namespace.
●   Uses relaxed semantics
●   By defining the semantics of data access that
    can be achieved without locking.




                  Los Alamos National
                  Laboratory
Parallel Virtual File System
                  (PVFS2)
●   PVFS2 shows that it is possible to build a
    parallel file system that implicitly maintains
    consistency by carefully structuring the metadata
    and name space and by defining the semantics
    of data access that can be achieved without
    locking.
●   This design leads to file system behavior that
    some traditional applications do not expect.
    ●   These relaxed semantics are not new in the field of
        parallel I/O. PVFS2 closely implements the
        semantics dictated by MPI-IO.


                      Los Alamos National
                      Laboratory
Parallel Virtual File System
                  (PVFS2)
●   PVFS2 also has native support
    for flexible noncontiguous data
    access patterns.
●   For example, imagine an
    application that reads a column
    of elements out of an array. To
    retrieve this data, the
    application might issue a large
    number of small and scattered
    reads to the file system.
●   However, if it could ask the file
    system for all of the
    noncontiguous elements in a
    single operation, both the file
    system and the application
    could perform more efficiently.



                         Los Alamos National
                         Laboratory
PVFS2 Stateless Architecture
●   PVFS2 is designed around a stateless architecture.
    ●   PVFS2 servers do not keep track of typical file system
        bookkeeping information such as which files have been opened,
        file positions, and so on.
    ●   There is also no shared lock state to manage.
●   The major advantage of a stateless architecture is that
    clients can fail and resume without disturbing the system
    as a whole.
●   It also allows PVFS2 to scale to hundreds of servers and
    thousands of clients without being impacted by the
    overhead and complexity of tracking file state or locking
    information associated with these clients.



                        Los Alamos National
                        Laboratory
PVFS2 Design Choices
●   These design choices enable PVFS2 to perform well in a
    parallel environment, but not so well if treated as a local
    file system.
    ●   Without client-side caching of metadata, status operations
        typically take a long time, as the information is retrieved over the
        network. This can make programs like “ls” take longer to
        complete than might be expected.
●   PVFS2 is better suited for I/O intensive applications,
    rather than for hosting a home directory.
    ●   PVFS2 is optimized for efficient reading and writing of large
        amounts of data, and thus it’s very well suited for scientific
        applications.




                          Los Alamos National
                          Laboratory
PVFS2 Components
●   The basic PVFS2 package consists of three
    components: a server, a client, and a kernel
    module.
    ●   The server runs on nodes that store either file system
        data or metadata.
    ●   The client and the kernel module are used by nodes
        that actively store or retrieve the data (or metadata)
        from the PVFS2 servers.
●   Unlike the original PVFS, each PVFS2 server
    can operate as a data server, a metadata server,
    or both simultaneously.

                      Los Alamos National
                      Laboratory
Accessing PVFS2 File Systems
●   Two methods are provided for accessing PVFS2 file
    systems.
    ●   The first is to mount the PVFS2 file system. This lets the user
        change and list directories, or move files, as well as execute
        binaries from the file system.
         ●   This mechanism introduces some performance overhead but is the
             most convenient way to access the file system interactively.
    ●   Scientific applications use the second method, MPI-IO.
         ●   The MPI-IO interface helps optimize access to single files by many
             processes on different nodes. It also provides “noncontiguous”
             access operations that allow for efficient access to data spread
             throughout the file.
         ●   For the pattern in Figure 2 this is done by asking for every eighth
             element starting at offset 0 and ending at offset 56, all as one file
             system operation.



                            Los Alamos National
                            Laboratory
PVFS2 Summary
●   There is no single file system that is the perfect
    solution for every I/O workload, and PVFS2 is no
    exception.
●   High-performance applications rely on a different
    set of features to access data than those
    provided by typical networked file systems.
    ●   PVFS2 is best suited for I/O-intensive applications.
    ●   PVFS2 was not intended for home directories, but as
        a separate, fast, scalable file system, it is very
        capable.


                      Los Alamos National
                      Laboratory
Red Hat Global File System
●   Red Hat Global File System (GFS) is an open source,
    POSIX-compliant cluster file system.
●   Red Hat GFS executes on Red Hat Enterprise Linux
    servers attached to a storage area network (SAN).
    ●   GFS runs on all major server and storage platforms supported by
        Red Hat.
●   Allows simultaneous reading and writing of blocks to a
    single shared file system on a SAN.
    ●   GFS can be configured without any single points of failure.
    ●   GFS can scale to hundreds of Red Hat Enterprise Linux servers.
    ●   GFS is compatible with all standard Linux applications.
●   Supports direct I/O by databases
    ●   Improves database performance by avoiding traditional file
        system overhead.



                        Los Alamos National
                        Laboratory
Red Hat Global File System
●   Red Hat Enterprise Linux allows organizations to utilize the default
    Linux file system, Ext3 (Third Extended file-system), NFS (Network
    File System) or Red Hat's GFS cluster file system.
    ●   Ext3 is a journaling file system, which uses log files to preserve the
        integrity of the file system in the event of a sudden failure. It is the
        standard file system used by all Red Hat Enterprise Linux systems.
    ●   NFS is the de facto standard approach to accessing files across the
        network.
    ●   GFS (Global File System) allows multiple servers to share access to the
        same files on a SAN while managing that access to avoid conflicts.
         ●   Sistina Software, the original developer of GFS, was acquired by Red Hat at
             the end of 2003. Subsequently, Red Hat contributed GFS to the open source
             community under the GPL license.
         ●   GFS is provided as a fully supported, optional layered product for Red Hat
             Enterprise Linux systems.




                             Los Alamos National
                             Laboratory
GFS Logical Volume Manager
●   Red Hat Enterprise Linux includes the Logical Volume Manager (LVM),
    which provides kernel-level storage virtualization capabilities. LVM supports
    a combination of physical storage elements into a collective storage pool,
    which can then be allocated and managed according to application
    requirements, without regard for the specifics of the underlying physical disk
    systems.
●   Initially developed by Sistina and now part of the standard the Linux kernel.
●   LVM provides enterprise-level volume management capabilities that are
    consistent with the leading, proprietary enterprise operating systems.
●   LVM capabilities include:
     ●   Storage performance and availability management by allowing for the addition
         and removal of physical devices and through dynamic disk volume resizing.
         Logical volumes can be resized dynamically online.
     ●   The Ext3 supports offline file system resizing (requiring unmount, resize, and
         mount operations).
     ●   Disk system management that enables the upgrading of disks, removal of failing
         disks, reorganization of workloads, and adaptation of storage capacity to
         changing system needs




                             Los Alamos National
                             Laboratory
GFS Multi-Pathing
●   Red Hat GFS works in concert with Red Hat
    Cluster Suite to provide failover of critical
    computing components for high availability.
●   Multi-path access to storage is essential to
    continued availability in the event of path
    failure (such as failure of a Host Bus
    Adapter).
●   Red Hat Enterprise Linux’s multi-path
    device driver (MD driver), recognizes
    multiple paths to the same device,
    eliminating the problem of the system
    assuming each path leads to a different
    disk.
    ●   MD driver combines the paths to a single
        disk, enabling failover to an alternate path if
        one path is disrupted.




                            Los Alamos National
                            Laboratory
GFS Enterprise Storage Options
●   Although SAN and NAS have emerged as the preferred enterprise
    storage approach, direct attached storage remains widespread
    throughout the enterprise. Red Hat Enterprise Linux supports the full
    set of enterprise storage options:
    ●   Direct attached storage
         ●   SCSI
         ●   ATA
         ●   Serial ATA
         ●   SAS (Serial Attached SCSI)
    ●   Networked storage
         ●   SAN (access to block-level data over Fibre Channel or IP networks)
         ●   NAS (access to data at the file level over IP networks)
    ●   Storage interconnects
         ●   Fibre Channel (FC)
         ●   iSCSI
         ●   GNBD (global network block device)
         ●   NFS




                             Los Alamos National
                             Laboratory
GFS on SAN’s
●   SANs provide direct block-level
    access to storage. When
    deploying a SAN with the Ext3 file
    system, each server mounts and
    accesses disk partitions
    individually. Concurrent access is
    not possible. When a server shuts
    down or fails, the clustering
    software will “failover” its disk
    partitions so that a remaining
    server can mount them and
    resume its tasks.
●   Deploying GFS on SAN-
    connected servers allows full
    sharing of all file system data,
    concurrently. These two
    configuration topologies are
    shown in the diagram.



                          Los Alamos National
                          Laboratory
GFS on NFS
●   In general, an NFS file server,
    usually configured with local
    storage, will serve file-level data
    across a network to remote NFS
    clients. This topology is best
    suited for non-shared data files
    (individual users' directories, for
    example) and is widely used in
    general purpose computing
    environments.
●   NFS configurations generally offer
    lower performance than block-
    based SAN environments, but
    they are configured using standard
    IP networking hardware so offer
    excellent scalability. They are also
    considerably less expensive.




                           Los Alamos National
                           Laboratory
GFS on iSCSI
●   Combining the performance and sharing capabilities of a
    SAN environment with the scalability and cost
    effectiveness of a NAS environment is highly desirable.
●   A topology that achieves this uses SAN technology to
    provide the core (“back end”) physical disk infrastructure,
    and then uses block-level IP technology to distribute
    served data to its eventual consumer across the network.
●   The emerging technology for delivering block-level data
    across a network is iSCSI.
    ●   This has been developing slowly for a number of years, but as
        the necessary standards have stabilized, adoption by industry
        vendors has started to accelerate considerably.
    ●   Red Hat Enterprise Linux currently supports iSCSI.




                        Los Alamos National
                        Laboratory
GFS on GNBD
●   As an alternative to iSCSI, Red Hat Enterprise Linux provides support for
    Red Hat’s Global Network Block Device (GNBD) protocol, which allows
    block-level data to be accessed over TCP/IP networks.
●   The combination of GNBD and GFS provides additional flexibility for sharing
    data on the SAN. This topology allows a GFS cluster to scale to hundreds of
    servers, which can concurrently mount a shared file system without the
    expense of including a Fibre Channel HBA and associated Fibre Channel
    switch port with every machine.
●   GNBD can make SAN data available to many other systems on the network
    without the expense of a Fibre Channel SAN connection.
     ●   Today, GNBD and iSCSI offer similar capabilities, however GNBD is a mature
         technology while iSCSI is still relatively new.
     ●   Red Hat provides GNBD as part of Red Hat Enterprise Linux so that customers
         can deploy IP network-based SANs today.
     ●   As iSCSI matures it is expected to supplant GNBD, offering better performance
         and a wider range of configuration options. An example configuration is shown in
         the diagram that follows.




                              Los Alamos National
                              Laboratory
GFS Summary
●   Enterprises can now deploy large sets of open
    source, commodity servers in a horizontal
    scalability strategy and achieve the same levels
    of processing power for far less cost.
●   Such horizontal scalability can lead an
    organization toward utility computing, where
    server and storage resources are added as
    needed. Red Hat Enterprise Linux provides
    substantial server and storage flexibility; the
    ability to add and remove servers and storage
    and to redirect and reallocate storage resources
    dynamically.

                  Los Alamos National
                  Laboratory
Summary
●   Panasas a clustered , asymmetric , parallel , object-based, distributed file system.
     ●   Implements file system entirely in hardware.
     ●   Claims highest sustained data rate of the four systems reviewed.
●   Lustre a clustered, asymmetric, parallel, object-based, distributed file system.
     ●   An open standards based system.
     ●   Great modularity and compatibility with interconnects, networking components and storage
         hardware.
     ●   Currently only available for Linux.
●   Parallel Virtual File System 2 (PVFS2) is a clustered, symmetric, parallel,
    aggregation-based, distributed file system.
     ●   Data access is achieved without file or metadata locking
     ●   PVFS2 is best suited for I/O-intensive (i.e., scientific) applications
     ●   PVFS2 could be used for high-performance scratch storage where data is copied and
         simulation results are written from thousands of cycles simultaneously.
●   Red Hat Global File System (GFS) is a clustered, symmetric, parallel, block-based,
    distributed file system.
     ●   An open standards based system.
     ●   Great modularity and compatibility with interconnects, networking components and storage
         hardware.
     ●   A relatively low-cost, SAN-based technology.
     ●   Only available on Red Hat Enterprise Linux.




                                 Los Alamos National
                                 Laboratory
Conclusions
●   No single clustered parallel file system can address the
    requirements of every environment.
●   Hardware based implementations have greater throughput then
    software based implementations.
●   Standards based implementations exhibit greater modularity and
    flexibility in interoperating with third-party components and appear
    most open to the incorporation of new technology.
●   All implementations appear to scale well into the thousands of
    clients, hundreds of servers and hundreds of TB’s of storage range.
●   All implementations appear to address the issue of hardware and
    software redundancy, component failover, and avoidance of a single
    point of failure.
●   All implementations exhibit the ability to take advantage of low-
    latency, high-bandwidth interconnects thus avoiding the overhead
    associated with TCP/IP networking.




                        Los Alamos National
                        Laboratory
Questions?




  Los Alamos National Laboratory
References
    Panasas:
   http://www.panasas.com/docs/Object_Storage_Architecture_WP.pdf

    Lustre:
   http://www.lustre.org/docs/whitepaper.pdf

    A Next-Generation Parallel File System for Linux Clusters:
    http://www.pvfs.org/files/linuxworld-JAN2004-PVFS2.ps


    Red Hat Global File System:
    http://www.redhat.com/whitepapers/rha/gfs/GFS_INS0032US.pdf

Red Hat Enterprise Linux: Creating a Scalable Open Source Storage Infrastructure:
    http://www.redhat.
   com/whitepapers/rhel/RHEL_creating_a_scalable_os_storage_infrastructure.pdf

    Exploring Clustered Parallel File Systems and Object Storage by Michael Ewan:
   http://www.intel.com/cd/ids/developer/asmona/eng/238284.htm?prn=Y


                            Los Alamos National
                            Laboratory

Weitere ähnliche Inhalte

Was ist angesagt?

Personal storage to enterprise storage system journey
Personal storage to enterprise storage system journeyPersonal storage to enterprise storage system journey
Personal storage to enterprise storage system journeySoumen Sarkar
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems ReviewSchubert Zhang
 
Storage essentials (by Merlin Ran)
Storage essentials (by Merlin Ran)Storage essentials (by Merlin Ran)
Storage essentials (by Merlin Ran)gigix1980
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfsdatabloginfo
 
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- DatasheetHitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- DatasheetHitachi Vantara
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemKonstantin V. Shvachko
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesNitin Khattar
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHanborq Inc.
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDataWorks Summit
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsDataWorks Summit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapakapa rohit
 
Hitachi overview-brochure-hus-hnas-family
Hitachi overview-brochure-hus-hnas-familyHitachi overview-brochure-hus-hnas-family
Hitachi overview-brochure-hus-hnas-familyHitachi Vantara
 
LCFS - Storage Driver for Docker
LCFS - Storage Driver for DockerLCFS - Storage Driver for Docker
LCFS - Storage Driver for DockerFred Love
 

Was ist angesagt? (20)

San nas-
San nas-San nas-
San nas-
 
Personal storage to enterprise storage system journey
Personal storage to enterprise storage system journeyPersonal storage to enterprise storage system journey
Personal storage to enterprise storage system journey
 
Distributed Filesystems Review
Distributed Filesystems ReviewDistributed Filesystems Review
Distributed Filesystems Review
 
GlusterFS as a DFS
GlusterFS as a DFSGlusterFS as a DFS
GlusterFS as a DFS
 
Storage essentials (by Merlin Ran)
Storage essentials (by Merlin Ran)Storage essentials (by Merlin Ran)
Storage essentials (by Merlin Ran)
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- DatasheetHitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
Hitachi Unified Storage and Hitachi NAS Platform 4000 Series -- Datasheet
 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File System
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapaHadoop HDFS by rohitkapa
Hadoop HDFS by rohitkapa
 
Storage
StorageStorage
Storage
 
Hitachi overview-brochure-hus-hnas-family
Hitachi overview-brochure-hus-hnas-familyHitachi overview-brochure-hus-hnas-family
Hitachi overview-brochure-hus-hnas-family
 
LCFS - Storage Driver for Docker
LCFS - Storage Driver for DockerLCFS - Storage Driver for Docker
LCFS - Storage Driver for Docker
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 

Andere mochten auch

Ncar globally accessible user environment
Ncar globally accessible user environmentNcar globally accessible user environment
Ncar globally accessible user environmentinside-BigData.com
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Chris Dagdigian
 
Architecture of the Upcoming OrangeFS v3 Distributed Parallel File System
Architecture of the Upcoming OrangeFS v3 Distributed Parallel File SystemArchitecture of the Upcoming OrangeFS v3 Distributed Parallel File System
Architecture of the Upcoming OrangeFS v3 Distributed Parallel File SystemAll Things Open
 
Ibm spectrum scale fundamentals workshop for americas part 6 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 6 spectrumscale el...Ibm spectrum scale fundamentals workshop for americas part 6 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 6 spectrumscale el...xKinAnx
 
HSM migration with EasyHSM and Nirvana
HSM migration with EasyHSM and NirvanaHSM migration with EasyHSM and Nirvana
HSM migration with EasyHSM and NirvanaIgor Sfiligoi
 
BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesChris Dagdigian
 
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral ProgramBig Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Programinside-BigData.com
 
Ibm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashIbm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashAshutosh Mate
 
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...xKinAnx
 

Andere mochten auch (11)

Ncar globally accessible user environment
Ncar globally accessible user environmentNcar globally accessible user environment
Ncar globally accessible user environment
 
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
Facilitating Collaborative Life Science Research in Commercial & Enterprise E...
 
Architecture of the Upcoming OrangeFS v3 Distributed Parallel File System
Architecture of the Upcoming OrangeFS v3 Distributed Parallel File SystemArchitecture of the Upcoming OrangeFS v3 Distributed Parallel File System
Architecture of the Upcoming OrangeFS v3 Distributed Parallel File System
 
Ibm spectrum scale fundamentals workshop for americas part 6 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 6 spectrumscale el...Ibm spectrum scale fundamentals workshop for americas part 6 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 6 spectrumscale el...
 
HSM migration with EasyHSM and Nirvana
HSM migration with EasyHSM and NirvanaHSM migration with EasyHSM and Nirvana
HSM migration with EasyHSM and Nirvana
 
BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the Trenches
 
EasyHSM Overview
EasyHSM OverviewEasyHSM Overview
EasyHSM Overview
 
A escolha da profissão!
A escolha da profissão!   A escolha da profissão!
A escolha da profissão!
 
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral ProgramBig Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
 
Ibm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ashIbm spectrum scale_backup_n_archive_v03_ash
Ibm spectrum scale_backup_n_archive_v03_ash
 
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
Ibm spectrum scale fundamentals workshop for americas part 1 components archi...
 

Ähnlich wie Survey of clustered_parallel_file_systems_004_lanl.ppt

What is Object storage ?
What is Object storage ?What is Object storage ?
What is Object storage ?Nabil Kassi
 
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERSPARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERSRaheemUnnisa1
 
Distributed file system
Distributed file systemDistributed file system
Distributed file systemAnamika Singh
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsDrPDShebaKeziaMalarc
 
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio ManfredOSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio ManfredNETWAYS
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Introduction to Data Storage and Cloud Computing
Introduction to Data Storage and Cloud ComputingIntroduction to Data Storage and Cloud Computing
Introduction to Data Storage and Cloud ComputingRutuja751147
 
Data storage in cloud computing
Data storage in cloud computingData storage in cloud computing
Data storage in cloud computingjamunaashok
 
Survey of distributed storage system
Survey of distributed storage systemSurvey of distributed storage system
Survey of distributed storage systemZhichao Liang
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptxSwarnaSLcse
 
002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptx002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptxDrewMe1
 
409793049-Storage-Virtualization-pptx.pptx
409793049-Storage-Virtualization-pptx.pptx409793049-Storage-Virtualization-pptx.pptx
409793049-Storage-Virtualization-pptx.pptxson2483
 
Product introduction- Apsara File Storage NAS
Product introduction- Apsara File Storage NASProduct introduction- Apsara File Storage NAS
Product introduction- Apsara File Storage NASJed Concepcion
 
Storage Networking and Overview ppt.pdf
Storage Networking and Overview ppt.pdfStorage Networking and Overview ppt.pdf
Storage Networking and Overview ppt.pdfDr. Sajal Saha
 
Elastic storage in the cloud session 5224 final v2
Elastic storage in the cloud session 5224 final v2Elastic storage in the cloud session 5224 final v2
Elastic storage in the cloud session 5224 final v2BradDesAulniers2
 

Ähnlich wie Survey of clustered_parallel_file_systems_004_lanl.ppt (20)

What is Object storage ?
What is Object storage ?What is Object storage ?
What is Object storage ?
 
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERSPARALLEL FILE SYSTEM FOR LINUX CLUSTERS
PARALLEL FILE SYSTEM FOR LINUX CLUSTERS
 
Distributed file system
Distributed file systemDistributed file system
Distributed file system
 
DAS RAID NAS SAN
DAS RAID NAS SANDAS RAID NAS SAN
DAS RAID NAS SAN
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data Analytics
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio ManfredOSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Introduction to Data Storage and Cloud Computing
Introduction to Data Storage and Cloud ComputingIntroduction to Data Storage and Cloud Computing
Introduction to Data Storage and Cloud Computing
 
Data storage in cloud computing
Data storage in cloud computingData storage in cloud computing
Data storage in cloud computing
 
Survey of distributed storage system
Survey of distributed storage systemSurvey of distributed storage system
Survey of distributed storage system
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptx
 
002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptx002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptx
 
409793049-Storage-Virtualization-pptx.pptx
409793049-Storage-Virtualization-pptx.pptx409793049-Storage-Virtualization-pptx.pptx
409793049-Storage-Virtualization-pptx.pptx
 
Product introduction- Apsara File Storage NAS
Product introduction- Apsara File Storage NASProduct introduction- Apsara File Storage NAS
Product introduction- Apsara File Storage NAS
 
Hadoop
HadoopHadoop
Hadoop
 
12. dfs
12. dfs12. dfs
12. dfs
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Storage Networking and Overview ppt.pdf
Storage Networking and Overview ppt.pdfStorage Networking and Overview ppt.pdf
Storage Networking and Overview ppt.pdf
 
Elastic storage in the cloud session 5224 final v2
Elastic storage in the cloud session 5224 final v2Elastic storage in the cloud session 5224 final v2
Elastic storage in the cloud session 5224 final v2
 

Kürzlich hochgeladen

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Kürzlich hochgeladen (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Survey of clustered_parallel_file_systems_004_lanl.ppt

  • 1. A Survey of Clustered Parallel File Systems for High Performance Computing Clusters James W. Barker, Ph. D. Los Alamos National Laboratory Computer, Computational and Statistical Sciences Division Los Alamos National Laboratory
  • 2. Definition of Terms ● Distributed File System - The generic term for a client/server or "network" file system where the data is not locally attached to a host. ● Network File System (NFS) is the most common distributed file system currently in use. ● Storage Area Network (SAN) File System – Provides a means for hosts to share Fiber Channel storage, which is traditionally separated into private physical areas bound to different hosts. A block-level metadata manager manages access to different SAN devices. A SAN File system mounts storage natively on only one node and connects all other nodes to that storage by distributing the block address of that storage to all other nodes. ● Scalability is often an issue due to the significant workload required of the metadata managers and the large network transactions required in order to access data. ● Examples include: IBM’s General Parallel File System (GPFS) and Sistina (now Red Hat) Global File System (GFS) Los Alamos National Laboratory
  • 3. Definition of Terms ● Symmetric File Systems - A symmetric file system is one in which the clients also host the metadata manager code, resulting in all nodes understanding the disk structures. ● A concern with these systems is the burden that metadata management places on the client node, serving both itself and other nodes, which can impact the ability of the client node to perform its intended computational jobs. ● Examples include IBM’s GPFS and Red Hat GFS ● Asymmetric File Systems - An asymmetric file system is a file system in which there are one or more dedicated metadata managers that maintain the file system and its associated disk structures. ● Examples include Panasas ActiveScale, Lustre and traditional NFS file systems. Los Alamos National Laboratory
  • 4. Definition of Terms ● Cluster File System - a distributed file system that is not a single server with a set of clients, but a cluster of servers that all work together to provide high performance storage service to their clients. ● To the clients the cluster file system is transparent, it is simply "the file system", but the file system software manages distributing requests to elements of the storage cluster. ● Examples include: Hewlett-Packard Tru64 cluster and Panasas ActiveScale ● Parallel File System - a parallel file system is one in which data blocks are striped, in parallel, across multiple storage devices on multiple storage servers. Support for parallel applications is provided allowing all nodes access to the same files at the same time, thus providing concurrent read and write capabilities. ● Network Link Aggregation, another parallel file system technique, is the technology used by PVFS2, in which the I/O is spread across several network connections in parallel, each packet taking a different link path from the previous packet. ● Examples of this include: Panasas ActiveScale, Lustre, PVFS2, GPFS and GFS. Los Alamos National Laboratory
  • 5. Definition of Terms ● An important note: all of the above definitions overlap. A SAN file system can be symmetric or asymmetric. Its servers may be clustered or single servers. And it may support parallel applications or it may not. ● For example; the Panasas Storage Cluster and its ActiveScale File System (a.k.a. PanFS) is a clustered (many servers share the work), asymmetric (metadata management does not occur on the clients), parallel (supports concurrent reads and writes), object-based (not block- based) distributed (clients access storage via the network) file system. ● Another example; the Lustre File System is also a clustered, asymmetric, parallel, object-based (referred to as targets by Lustre), distributed file system. ● Another example, the Parallel Virtual File System 2 (PVFS2) is a clustered, symmetric, parallel, aggregation-based, distributed file system. ● And finally; the Red Hat Global File System (GFS) is a clustered, symmetric, parallel, block-based, distributed file system. Los Alamos National Laboratory
  • 6. Object Storage Components ● An Object contains the data and enough additional information to allow the data to be autonomous and self-managing. ● An Object-based Storage Device (OSD) is an intelligent evolution of the disk drive capable of storing and serving objects rather then simply coping data to tracks and sectors. (The term OSD does not exist in Lustre) ● The term OSD in Panasas = The term OST in Lustre ● An Object-based Storage Target (OST) is an abstraction layer above the physical blocks of a physical disk (in Panasas terminology, not in Lustre). ● An Object-Based Disk (OBD) is an abstraction of the physical blocks of the physical disks (in Lustre terminology, OBD’s do not exist in Panasas terminology). ● An Installable File System (IFS) integrates with compute nodes, accepts POSIX file system commands and data from the Operating System, addresses the OSD’s directly and stripes the objects across multiple OSD’s. ● A Metadata Server intermediates throughout multiple compute nodes in the environment, allowing them to share data while maintaining cache consistency on all nodes. ● The Network Fabric ties the compute nodes to the OSD’s and metadata servers. Los Alamos National Laboratory
  • 7. Storage Objects ● Each file or directory can be thought of as an object. As with all objects, storage objects have attributes. ● Each storage object attribute can be assigned a value such as file type, file location, whether the data is striped or not, ownership, and permissions. ● An object storage device (OSD) allows us to specify for each file where to store the blocks allocated to the file, via a metadata server and object storage targets. ● Extending the storage attribute further, it can also be specified how many object storage targets to stripe onto and what level of redundancy to employ on the target. ● Some implementations (Panasas) allow the specification of RAID 0 (striped) or RAID 1 (mirrored) on a per-file basis. Los Alamos National Laboratory
  • 8. Panasas ● Within the storage device, all objects are accessed via a 96-bit object ID. The object is accessed based on the object ID, the beginning of the range of bytes inside the object and the length of the byte range that is of interest (<objectID, offset, length>). ● There are three different types of objects: ● The “Root” object on the storage device identifies the storage device and various attributes of the device; including total capacity and available capacity. ● A “Group” object provides a “directory” to a logical subset of the objects on the storage device. ● A ”User” object contains the actual application data to be stored. Los Alamos National Laboratory
  • 9. Panasas ● The “User” object is a container for data and two types of attributes: ● Application Data is essentially the equivalent of the data that a file would normally have in a conventional file system. It is accessed with file-like commands such as Open, Close, Read and Write. ● Storage Attributes are used by the storage device to manage the block allocation for the data. This includes the object ID, block pointers, logical length and capacity used. This is similar to the inode-level attributes inside a traditional file system. ● User Attributes are opaque to the storage device and are used by applications and metadata managers to store higher-level information about the object. ● These attributes can include; file system attributes such as ownership and access control lists (ACL’s), Quality of Service requirements that apply to a specific object and how the storage system treats a specific object (i.e., what level of RAID to apply, the size of the user’s quota or the performance characteristics required for that data). Los Alamos National Laboratory
  • 10. Panasas ● The Panasas concept of object storage is implemented entirely in hardware. ● The Panasas ActiveScale File System supports two modes of data access: ● DirectFLOW is an out of band solution enabling Linux Cluster nodes to directly access data on StorageBlades in parallel. ● NFS/CIFS operates in band, utilizing the DirectorBlades as a gateway between NFS/CIFS clients and StorageBlades. Los Alamos National Laboratory
  • 11. Panasas Performance ● Random I/O - SPECsfs97_R1.v3 as measured by Standard Performance Evaluation Corporation (www.spec.org) a Panasas ActiveScale storage cluster produced a peak of 305,805 random I/O Operations/Second. ● Data Throughput – as measured “in-house” by Panasas on a similarly configured cluster delivered a sustained 10.1 GBytes/ Second on sequential I/O read tests. Los Alamos National Laboratory
  • 12. Lustre ● Lustre is an open, standards-based technology that runs on commodity hardware and uses object-based disks for storage and metadata servers for file system metadata. ● This design provides an efficient division of labor between computing and storage resources. ● Replicated, failover MetaData Servers (MDSs) maintain a transactional record of high-level file and file system changes. ● Distributed Object Storage Targets (OSTs) are responsible for actual file system I/O and for interfacing with storage devices. ● File operations bypass the metadata server completely and utilize the parallel data paths to all OSTs in the cluster. ● Lustre’s approach of separating metadata operations from data operations results in enhanced performance. ● The division of metadata and data operations creates a scalable file system with greater recoverability from failure conditions by providing the advantages of both journaling and distributed file systems. Los Alamos National Laboratory
  • 13. Lustre ● Lustre supports strong file and metadata locking semantics to maintain coherency of the file systems even under a high volume of concurrent access. ● File locking is distributed across the Object Storage Targets (OSTs) that constitute the file system, with each OST managing locks for the objects that it stores. ● Lustre uses an open networking stack composed of three layers: ● At the top of the stack is the Lustre request processing layer. ● Beneath the Lustre request processing layer is the Portals API developed by Sandia National Laboratory. ● At the bottom of the stack is the Network Abstraction Layer (NAL) which is intended to provide out-of-the-box support for multiple types of networks. Los Alamos National Laboratory
  • 14. Lustre ● Lustre provides security in the form of authentication, authorization and privacy by leveraging existing security systems. ● This eases incorporation of Lustre into existing enterprise security environments without requiring changes to Luster. ● Similarly, Lustre leverages the underlying journaling file systems provided by Linux ● These journaling file systems enable persistent state recovery providing resiliency and recoverability from failed OST’s. ● Finally, Lustre’s configuration and state information is recorded and managed using open standards such as XML and LDAP ● Easing the task of integrating Lustre into existing environments or third-party tools. Los Alamos National Laboratory
  • 15. Lustre ● Lustre technology is designed to scale while maintaining resiliency. ● As servers are added to a typical cluster environment, failures become more likely due to the increasing number of physical components. ● Lustre’s support for resilient, redundant hardware provides protection from inevitable hardware failures through transparent failover and recovery. Los Alamos National Laboratory
  • 16. Lustre File System Abstractions ● The Lustre file system provides several abstractions designed to improve both performance and scalability. ● At the file system level, Lustre treats files as objects that are located through metadata Servers (MDSs). ● Metadata Servers support all file system namespace operations: ● These operations include file lookups, file creation and file and directory attribute manipulation. As well as directing actual file I/O requests to Object Storage Targets (OSTs), which manage the storage that is physically located on underlying Object- Based Disks (OBDs). ● Metadata servers maintain a transactional record of file system metadata changes and cluster status, as well as supporting failover operations. Los Alamos National Laboratory
  • 17. Lustre Inodes, OST’s & OBD’s ● Like traditional file systems, the Lustre file system has a unique inode for every regular file, directory, symbolic link, and special file. ● Creating a new file causes the client to contact a metadata server, which creates an inode for the file and then contacts the OSTs to create objects that will actually hold file data. ● Metadata for the objects is held in the inode as extended attributes for the file. ● The objects allocated on OSTs hold the data associated with the file and can be striped across several OSTs in a RAID pattern. ● Within the OST, data is actually read and written to underlying storage known as Object-Based Disks (OBDs). ● Subsequent I/O to the newly created file is done directly between the client and the OST, which interacts with the underlying OBDs to read and write data. ● The metadata server is only updated when additional namespace changes associated with the new file are required. Los Alamos National Laboratory
  • 18. Lustre Network Independence ● Lustre can be used over a wide variety of networks due to its use of an open Network Abstraction Layer. Lustre is currently in use over TCP and Quadrics (QSWNet) networks. ● Myrinet, Fibre Channel, Stargen and InfiniBand support are under development. ● Lustre's network-neutrality enables Lustre to quickly take advantage of performance improvements provided by network hardware and protocol improvements offered by new systems. ● Lustre provides unique support for heterogeneous networks. ● For example, it is possible to connect some clients over an Ethernet to the MDS and OST servers, and others over a QSW network, in a single installation. Los Alamos National Laboratory
  • 19. Lustre ● One drawback to Lustre is that a Lustre client cannot run on a server that is providing OSTs. ● Lustre has not been ported to support UNIX and Windows operating systems. ● Lustre clients can and probably will be implemented on non-Linux platforms, but as of this date, Lustre is available only on Linux. Los Alamos National Laboratory
  • 20. Lustre Performance ● Hewlett-Packard (HP) and Pacific Northwest National Laboratory (PNNL) have partnered on the design, installation, integration and support of one of the top 10 fastest computing clusters in the world. ● The HP Linux super cluster, with more than 1,800 Itanium® 2 processors, is rated at more than 11 TFLOPS. ● PNNL has run Lustre for more than a year and currently sustains over 3.2 GB/s of bandwidth running production loads on a 53-terabyte Lustre-based file share. ● Individual Linux clients are able to write data to the parallel Lustre servers at more than 650 MB/s. Los Alamos National Laboratory
  • 21. Luster Summary ● Lustre is a storage architecture and distributed file system that provides significant performance, scalability, and flexibility to computing clusters. ● Lustre uses an object storage model for file I/O, and storage management to provide an efficient division of labor between computing and storage resources. ● Replicated, failover metadata Servers (MDSs) maintain a transactional record of high-level file and file system changes. ● Distributed Object Storage Targets (OSTs) are responsible for actual file system I/O and for interfacing with local or networked storage devices known as Object-Based Disks (OBDs). ● Lustre leverages open standards such as Linux, XML, LDAP, readily available open source libraries, and existing file systems to provide a scalable, reliable distributed file system. ● Lustre uses failover, replication, and recovery techniques to minimize downtime and to maximize file system availability, thereby maximizing cluster productivity. Los Alamos National Laboratory
  • 22. Storage Aggregation ● Rather than providing scalable performance by striping data across dedicated storage devices, storage aggregation provides scalable capacity by utilizing available storage blocks on each compute node. ● Each compute node runs a server daemon that provides access to free space on the local disks. ● Additional software runs on each client node that combines those available blocks into a virtual device and provides locking and concurrent access to the other compute nodes. ● Each compute node could potentially be a server of blocks and a client. Using storage aggregation on a large (>1000 node) cluster, 10’s of TB of free storage could potentially be made available for use as high-performance temporary space. Los Alamos National Laboratory
  • 23. Parallel Virtual File System (PVFS2) ● Parallel Virtual File System 2 (PVFS2) is an open source project from Clemson University that provides a lightweight server daemon to provide simultaneous access to storage devices from hundreds to thousands of clients. ● Each node in the cluster can be a server, a client, or both. ● Since storage servers can also be clients, PVFS2 supports striping data across all available storage devices in the cluster (e.g., storage aggregation) . ● PVFS2 is best suited for providing large, fast temporary storage. Los Alamos National Laboratory
  • 24. Parallel Virtual File System (PVFS2) ● Implicitly maintains consistency by carefully structuring metadata and namespace. ● Uses relaxed semantics ● By defining the semantics of data access that can be achieved without locking. Los Alamos National Laboratory
  • 25. Parallel Virtual File System (PVFS2) ● PVFS2 shows that it is possible to build a parallel file system that implicitly maintains consistency by carefully structuring the metadata and name space and by defining the semantics of data access that can be achieved without locking. ● This design leads to file system behavior that some traditional applications do not expect. ● These relaxed semantics are not new in the field of parallel I/O. PVFS2 closely implements the semantics dictated by MPI-IO. Los Alamos National Laboratory
  • 26. Parallel Virtual File System (PVFS2) ● PVFS2 also has native support for flexible noncontiguous data access patterns. ● For example, imagine an application that reads a column of elements out of an array. To retrieve this data, the application might issue a large number of small and scattered reads to the file system. ● However, if it could ask the file system for all of the noncontiguous elements in a single operation, both the file system and the application could perform more efficiently. Los Alamos National Laboratory
  • 27. PVFS2 Stateless Architecture ● PVFS2 is designed around a stateless architecture. ● PVFS2 servers do not keep track of typical file system bookkeeping information such as which files have been opened, file positions, and so on. ● There is also no shared lock state to manage. ● The major advantage of a stateless architecture is that clients can fail and resume without disturbing the system as a whole. ● It also allows PVFS2 to scale to hundreds of servers and thousands of clients without being impacted by the overhead and complexity of tracking file state or locking information associated with these clients. Los Alamos National Laboratory
  • 28. PVFS2 Design Choices ● These design choices enable PVFS2 to perform well in a parallel environment, but not so well if treated as a local file system. ● Without client-side caching of metadata, status operations typically take a long time, as the information is retrieved over the network. This can make programs like “ls” take longer to complete than might be expected. ● PVFS2 is better suited for I/O intensive applications, rather than for hosting a home directory. ● PVFS2 is optimized for efficient reading and writing of large amounts of data, and thus it’s very well suited for scientific applications. Los Alamos National Laboratory
  • 29. PVFS2 Components ● The basic PVFS2 package consists of three components: a server, a client, and a kernel module. ● The server runs on nodes that store either file system data or metadata. ● The client and the kernel module are used by nodes that actively store or retrieve the data (or metadata) from the PVFS2 servers. ● Unlike the original PVFS, each PVFS2 server can operate as a data server, a metadata server, or both simultaneously. Los Alamos National Laboratory
  • 30. Accessing PVFS2 File Systems ● Two methods are provided for accessing PVFS2 file systems. ● The first is to mount the PVFS2 file system. This lets the user change and list directories, or move files, as well as execute binaries from the file system. ● This mechanism introduces some performance overhead but is the most convenient way to access the file system interactively. ● Scientific applications use the second method, MPI-IO. ● The MPI-IO interface helps optimize access to single files by many processes on different nodes. It also provides “noncontiguous” access operations that allow for efficient access to data spread throughout the file. ● For the pattern in Figure 2 this is done by asking for every eighth element starting at offset 0 and ending at offset 56, all as one file system operation. Los Alamos National Laboratory
  • 31. PVFS2 Summary ● There is no single file system that is the perfect solution for every I/O workload, and PVFS2 is no exception. ● High-performance applications rely on a different set of features to access data than those provided by typical networked file systems. ● PVFS2 is best suited for I/O-intensive applications. ● PVFS2 was not intended for home directories, but as a separate, fast, scalable file system, it is very capable. Los Alamos National Laboratory
  • 32. Red Hat Global File System ● Red Hat Global File System (GFS) is an open source, POSIX-compliant cluster file system. ● Red Hat GFS executes on Red Hat Enterprise Linux servers attached to a storage area network (SAN). ● GFS runs on all major server and storage platforms supported by Red Hat. ● Allows simultaneous reading and writing of blocks to a single shared file system on a SAN. ● GFS can be configured without any single points of failure. ● GFS can scale to hundreds of Red Hat Enterprise Linux servers. ● GFS is compatible with all standard Linux applications. ● Supports direct I/O by databases ● Improves database performance by avoiding traditional file system overhead. Los Alamos National Laboratory
  • 33. Red Hat Global File System ● Red Hat Enterprise Linux allows organizations to utilize the default Linux file system, Ext3 (Third Extended file-system), NFS (Network File System) or Red Hat's GFS cluster file system. ● Ext3 is a journaling file system, which uses log files to preserve the integrity of the file system in the event of a sudden failure. It is the standard file system used by all Red Hat Enterprise Linux systems. ● NFS is the de facto standard approach to accessing files across the network. ● GFS (Global File System) allows multiple servers to share access to the same files on a SAN while managing that access to avoid conflicts. ● Sistina Software, the original developer of GFS, was acquired by Red Hat at the end of 2003. Subsequently, Red Hat contributed GFS to the open source community under the GPL license. ● GFS is provided as a fully supported, optional layered product for Red Hat Enterprise Linux systems. Los Alamos National Laboratory
  • 34. GFS Logical Volume Manager ● Red Hat Enterprise Linux includes the Logical Volume Manager (LVM), which provides kernel-level storage virtualization capabilities. LVM supports a combination of physical storage elements into a collective storage pool, which can then be allocated and managed according to application requirements, without regard for the specifics of the underlying physical disk systems. ● Initially developed by Sistina and now part of the standard the Linux kernel. ● LVM provides enterprise-level volume management capabilities that are consistent with the leading, proprietary enterprise operating systems. ● LVM capabilities include: ● Storage performance and availability management by allowing for the addition and removal of physical devices and through dynamic disk volume resizing. Logical volumes can be resized dynamically online. ● The Ext3 supports offline file system resizing (requiring unmount, resize, and mount operations). ● Disk system management that enables the upgrading of disks, removal of failing disks, reorganization of workloads, and adaptation of storage capacity to changing system needs Los Alamos National Laboratory
  • 35. GFS Multi-Pathing ● Red Hat GFS works in concert with Red Hat Cluster Suite to provide failover of critical computing components for high availability. ● Multi-path access to storage is essential to continued availability in the event of path failure (such as failure of a Host Bus Adapter). ● Red Hat Enterprise Linux’s multi-path device driver (MD driver), recognizes multiple paths to the same device, eliminating the problem of the system assuming each path leads to a different disk. ● MD driver combines the paths to a single disk, enabling failover to an alternate path if one path is disrupted. Los Alamos National Laboratory
  • 36. GFS Enterprise Storage Options ● Although SAN and NAS have emerged as the preferred enterprise storage approach, direct attached storage remains widespread throughout the enterprise. Red Hat Enterprise Linux supports the full set of enterprise storage options: ● Direct attached storage ● SCSI ● ATA ● Serial ATA ● SAS (Serial Attached SCSI) ● Networked storage ● SAN (access to block-level data over Fibre Channel or IP networks) ● NAS (access to data at the file level over IP networks) ● Storage interconnects ● Fibre Channel (FC) ● iSCSI ● GNBD (global network block device) ● NFS Los Alamos National Laboratory
  • 37. GFS on SAN’s ● SANs provide direct block-level access to storage. When deploying a SAN with the Ext3 file system, each server mounts and accesses disk partitions individually. Concurrent access is not possible. When a server shuts down or fails, the clustering software will “failover” its disk partitions so that a remaining server can mount them and resume its tasks. ● Deploying GFS on SAN- connected servers allows full sharing of all file system data, concurrently. These two configuration topologies are shown in the diagram. Los Alamos National Laboratory
  • 38. GFS on NFS ● In general, an NFS file server, usually configured with local storage, will serve file-level data across a network to remote NFS clients. This topology is best suited for non-shared data files (individual users' directories, for example) and is widely used in general purpose computing environments. ● NFS configurations generally offer lower performance than block- based SAN environments, but they are configured using standard IP networking hardware so offer excellent scalability. They are also considerably less expensive. Los Alamos National Laboratory
  • 39. GFS on iSCSI ● Combining the performance and sharing capabilities of a SAN environment with the scalability and cost effectiveness of a NAS environment is highly desirable. ● A topology that achieves this uses SAN technology to provide the core (“back end”) physical disk infrastructure, and then uses block-level IP technology to distribute served data to its eventual consumer across the network. ● The emerging technology for delivering block-level data across a network is iSCSI. ● This has been developing slowly for a number of years, but as the necessary standards have stabilized, adoption by industry vendors has started to accelerate considerably. ● Red Hat Enterprise Linux currently supports iSCSI. Los Alamos National Laboratory
  • 40. GFS on GNBD ● As an alternative to iSCSI, Red Hat Enterprise Linux provides support for Red Hat’s Global Network Block Device (GNBD) protocol, which allows block-level data to be accessed over TCP/IP networks. ● The combination of GNBD and GFS provides additional flexibility for sharing data on the SAN. This topology allows a GFS cluster to scale to hundreds of servers, which can concurrently mount a shared file system without the expense of including a Fibre Channel HBA and associated Fibre Channel switch port with every machine. ● GNBD can make SAN data available to many other systems on the network without the expense of a Fibre Channel SAN connection. ● Today, GNBD and iSCSI offer similar capabilities, however GNBD is a mature technology while iSCSI is still relatively new. ● Red Hat provides GNBD as part of Red Hat Enterprise Linux so that customers can deploy IP network-based SANs today. ● As iSCSI matures it is expected to supplant GNBD, offering better performance and a wider range of configuration options. An example configuration is shown in the diagram that follows. Los Alamos National Laboratory
  • 41. GFS Summary ● Enterprises can now deploy large sets of open source, commodity servers in a horizontal scalability strategy and achieve the same levels of processing power for far less cost. ● Such horizontal scalability can lead an organization toward utility computing, where server and storage resources are added as needed. Red Hat Enterprise Linux provides substantial server and storage flexibility; the ability to add and remove servers and storage and to redirect and reallocate storage resources dynamically. Los Alamos National Laboratory
  • 42. Summary ● Panasas a clustered , asymmetric , parallel , object-based, distributed file system. ● Implements file system entirely in hardware. ● Claims highest sustained data rate of the four systems reviewed. ● Lustre a clustered, asymmetric, parallel, object-based, distributed file system. ● An open standards based system. ● Great modularity and compatibility with interconnects, networking components and storage hardware. ● Currently only available for Linux. ● Parallel Virtual File System 2 (PVFS2) is a clustered, symmetric, parallel, aggregation-based, distributed file system. ● Data access is achieved without file or metadata locking ● PVFS2 is best suited for I/O-intensive (i.e., scientific) applications ● PVFS2 could be used for high-performance scratch storage where data is copied and simulation results are written from thousands of cycles simultaneously. ● Red Hat Global File System (GFS) is a clustered, symmetric, parallel, block-based, distributed file system. ● An open standards based system. ● Great modularity and compatibility with interconnects, networking components and storage hardware. ● A relatively low-cost, SAN-based technology. ● Only available on Red Hat Enterprise Linux. Los Alamos National Laboratory
  • 43. Conclusions ● No single clustered parallel file system can address the requirements of every environment. ● Hardware based implementations have greater throughput then software based implementations. ● Standards based implementations exhibit greater modularity and flexibility in interoperating with third-party components and appear most open to the incorporation of new technology. ● All implementations appear to scale well into the thousands of clients, hundreds of servers and hundreds of TB’s of storage range. ● All implementations appear to address the issue of hardware and software redundancy, component failover, and avoidance of a single point of failure. ● All implementations exhibit the ability to take advantage of low- latency, high-bandwidth interconnects thus avoiding the overhead associated with TCP/IP networking. Los Alamos National Laboratory
  • 44. Questions? Los Alamos National Laboratory
  • 45. References Panasas: http://www.panasas.com/docs/Object_Storage_Architecture_WP.pdf Lustre: http://www.lustre.org/docs/whitepaper.pdf A Next-Generation Parallel File System for Linux Clusters: http://www.pvfs.org/files/linuxworld-JAN2004-PVFS2.ps Red Hat Global File System: http://www.redhat.com/whitepapers/rha/gfs/GFS_INS0032US.pdf Red Hat Enterprise Linux: Creating a Scalable Open Source Storage Infrastructure: http://www.redhat. com/whitepapers/rhel/RHEL_creating_a_scalable_os_storage_infrastructure.pdf Exploring Clustered Parallel File Systems and Object Storage by Michael Ewan: http://www.intel.com/cd/ids/developer/asmona/eng/238284.htm?prn=Y Los Alamos National Laboratory