SlideShare ist ein Scribd-Unternehmen logo
1 von 11
Storage for Science

Methods for Managing Large and Rapidly Growing Data Stores
in Life Science Research Environments


An Isilon® Systems Whitepaper

August 2008




Prepared by:
Table of Contents


Introduction                                           3

Requirements for Science                               3

   “Large” Capacity                                    3

   Accelerating Growth                                 4

   Variable File Types and Operations                 4

   Shared Read/Write Access                            4

   Ease of Use                                         5

Understanding the Alternatives                        5

   Common Feature Trade-offs                           5

   Direct Attached Storage (DAS)                       6

   Storage Area Network (SAN)                         7

   Network Attached Storage (NAS)                      8

   Asymmetric Clustered Storage                        8

   Symmetric Clustered Storage                         9

Isilon Clustered Storage Solution                      9

   OneFS Operating System                             10

   Inherent High Availability & Reliability           10

   Single Level of Management                         11

   Linear Scalability in Performance & Capacity       11

Conclusion                                            11




    ISILON SYSTEMS
                                                  2
Introduction

This document is intended to inform the Life Science researcher with large and rapidly growing
data storage needs. We explore many of the storage requirements common to Life Science
research and explain the evolution of modern storage architectures from local disks through
symmetric clustered storage. Finally, we present Isilon’s IQ clustered storage solution in detail.



Requirements for Science

“Large” Capacity
Many branches of Life Science research involve the generation, accumulation, analysis, and
distribution of “large” amounts of data. What is considered “large” changes rapidly as data
generation increases through advances in scientific methods and instrumentation. These
advances are offset by capacity increases in storage technologies that are undergoing their own
rapid evolution. Presently, Neuro-Imaging and Next-Generation Sequencing are branches of
science churning out massive amounts of data that push the limits of “large”. We will explore
these two specific examples in further detail.

Neuro-Imaging
A common Neuro-Imaging experiment involves fMRI (Functional Magnetic Resonance Imaging)
to determine activated regions of the brain in response to a stimulus. This “brain mapping” is
achieved by observing increased blood flow to the activated areas of the brain using an fMRI
scanner. The scanning of a single human test subject might occur over a 60 to 90 minute period,
with hundreds of discrete scans every few seconds, generating as much as 1GB of data per
subject. A single instrument operating at only 50% capacity can produce many terabytes (1,000s
of GBs) of data per year. The Neuro-Imaging centers interviewed for this paper utilize up to ten
instruments, supporting dozens of scientists, each allocated a baseline of 2TB of disk space for
their ongoing experiments. While this rapid scaling is a significant challenge for many labs, data
growth of 10 to 20 TB per year is not unusual in these environments.

“Next-Generation” DNA Sequencing
DNA sequencing has undergone a revolution in recent years. Driven by novel sequencing
chemistries, micro-fluidic systems, and reaction detection methods, “Next-Generation”
sequencing instruments from 454, Illumina, ABI, and Helicos offer 100 to 1000-fold increased
throughput, combined with an additional 100 to 1000-fold decreased cost per nucleotide when
compared with conventional Sanger sequencing. This change has put high-throughput genome
sequencing, once achievable by only a few major sequencing centers, within reach of many
smaller research groups and individual research labs. The result for such labs is a dramatic
increase in storage requirements from gigabytes to petabytes (1 million GB) in only the course of
a couple of years.

Each Next-Generation sequencing platform is unique in terms of the nature and volume of the
data it generates. Typically, anywhere from 600GB (gigabytes) to 6TB (terabytes) of primary
image data is written over a period of one to three days. By today’s standards, a terabyte is not
large. However, for a single laboratory, accumulating and moving terabytes of data per day
without loss can be a significant challenge, especially for small sequencing labs that have not yet
adopted a highly scalable storage solution.




    ISILON SYSTEMS
                                                  3
Accelerating Growth
Storage capacity planning for Life Science research is particularly difficult in that requirements
change rapidly and at irregular rates. Planning for growth according to the number of users or
number of instruments is often insufficient when, for instance, a new grant can double capacity
needs. Similarly a revolutionary new instrument might increase data production by an exponential
amount. To be responsive to the requirements of Life Science research, an ideal storage
architecture must be scalable in both small and large increments without requiring a system
redesign or replacement. Ideally, a storage solution should have “pay-as-you-grow”
characteristics that allow for growth as-needed.

Variable File Types and Operations
Life Science data is highly variable, both in composition and in the way that it is accessed.
Therefore, an ideal storage system for Life Science organizations must have good I/O
performance across these varied use cases:

    -   Many small files or fewer big files
    -   Text files and binary files
    -   Sequential and random access
    -   Highly concurrent access

This variability is common to both neuro-imaging and next-generation sequencing. Massive
simultaneous computations are performed upon many, large primary image files ranging in the
gigabytes and requiring highly parallel streaming (I/O), resulting in fewer, smaller text files. The
resulting data might be kept within directories containing thousands to hundreds-of-thousands of
files, totaling many terabytes.

Shared Read/Write Access
Storage systems for Life Science data must be simultaneously accessible to many instruments,
users, analysis computers, and data servers. These storage systems cannot reside in isolated
silos with limited accessibility. They must, instead, permit concurrent, integrated, file-level
read/write access across the entire organization with I/O bandwidth that scales to accommodate
concurrent demand.

A typical Neuro-Imaging or Next-Generation sequencing workflow involves the following steps:

    -   Multiple instruments generate primary image data.
    -   Large memory SMP machines and compute clusters distill the primary data into a derived
        form.
    -   Researchers evaluate and annotate the data to answer scientific questions.
    -   Researchers iterate on the above process, adding more primary data and refining their
        analyses.
    -   Finally, results are served to a wider audience via internet repositories, usually accessed
        via FTP or HTTP.

The requirements of the workflow above are the sum of requirements from instruments,
researchers, computing systems, and customers. A sustainable storage plan for even a small
research organization requires a system with shared, file-level read/write access to a common,
large, scalable storage repository and should allow access by these common protocols:

    -   NFS (Network File System) – The common network file system for UNIX instruments and
        analysis computers
    -   SMB/CIFS (Server Message Block/Common Internet File System) – The common
        network file system for Windows-based instruments and user desktops



    ISILON SYSTEMS
                                                  4
-   HTTP (Hypertext Transfer Protocol) – The file transfer protocol used in the World Wide
        Web
    -   FTP (File Transfer Protocol) – A common internet file transfer protocol for disseminating
        data

Ease of Use
At many levels, ease of use is the most significant storage requirement for Life Science research,
even though it is generally the most difficult to quantify.

Management
The human resources involved in maintaining a large storage system range from just above zero
to many FTEs (full time equivalents). The management of an ideal storage system should not
require the hiring of additional, dedicated IT staff.

Scaling
Scaling a storage system’s capacity and/or performance, whether by fractional amounts or by
orders of magnitude, should not require multiple man-months of meetings to plan, or even several
man-days of IT technical expertise to implement. Scaling an ideal storage system should be able
to be performed in minutes, independent of scale.

User
Ideally, the researcher is focused on science, not computers or disks. The researcher shouldn’t
be concerned with or aware of volumes, capacity, formats, or how to access their data. Upon
scaling storage a user might notice that capacity suddenly increased, but never experience an
interruption in service.



Understanding the Alternatives

Common Feature Trade-offs
Like most products, storage solutions compete based on their features. An ideal storage solution
would excel at all features: have high I/O performance rates, never become inaccessible, never
lose data, have the ability to become infinitely large, be scalable in both large and small
increments, have a low purchase price, require little human effort to manage, and be easy to use.

In the real world, decisions are based on which of these requirements are most important within a
given budget.

Storage decisions typically reduce to four factors:

    -   Will this provide me sufficient performance and capacity for my present needs?
    -   Will I experience any significant down-time or data loss?
    -   Do I have the human resources needed to manage the system?
    -   How long will it be before I need to upgrade this system and at what cost?

When designing storage systems in a scientific research environment, many variables come into
play. Present capacity needs may be the easiest to quantify, but are only a starting point.
Performance requirements aren’t generally known until after the storage has been deployed and
workflows are executed against data. Data loss is known to be a very bad thing, but quantifying
the cost of loss is difficult when the core value to the lab might be a publication or a discovery.
Labor costs may be very indirect; the use of graduate students as part-time systems



    ISILON SYSTEMS
                                                 5
administrators is a prime example. Students come and go, which can impose high additional
costs if storage systems are difficult to learn or require specialized training. Particularly in primary
research, very early in a product pipeline, it can be difficult to set dollar values on these factors.
However, such variables must be considered in order to make sensible storage infrastructure
decisions.

All storage systems have maximum scaling limits. However with some, once this limit is reached
one must either (a) deploy a 2nd standalone system or (b) perform a “forklift upgrade” in which
major components are retired and replaced with bigger/newer ones.

In addition to the decision parameters above, some storage architectures can be tuned to
optimize certain facets of their performance. This means that they can be configured to excel at
one feature or another, but not all at the same time:

    -   I/O Performance – The speed of writing data to disk and reading it back to the CPU and
        user. This might be further broken down into transactional, sequential, and random
        access patterns.
    -   Availability – The cost associated with maintaining uninterrupted access to the data
    -   Reliability – The cost associated with mitigating the risk of data loss
    -   Maximum Scalability – The largest data volume the storage system can ever hold
    -   Dynamic Configuration – The cost in time and effort to make a change to the system
    -   Resolution of Scale – The smallest increment by which the storage can be made larger
    -   Purchase Price – The cost to purchase the system
    -   Total Cost of Ownership – The cost to buy and operate the system over its useful life
    -   Ease of Use – The cost in time and effort to get it working and keep it working


Direct Attached Storage (DAS)




Discrete Disk
The local disk method is the most familiar means of providing storage, either directly attached to
the motherboard of a system, or attached through USB, Firewire, SCSI, or Fibre Channel cables.
While it may seem odd (when discussing critical, enterprise-class storage systems) to consider
using discrete local disks, due to the explosion of data and the ubiquity of simple DAS type
storage, labs containing shelves filled with Firewire drives being used as long-term archive
solutions are not that uncommon. These discrete disks can be purchased for as little as $200-
$300 per TB and are fairly easy to use. In the very short term, this may seem to be a reasonable
purchasing decision, coupled with the reagents necessary for an instrument run. However,
combining these disks into a larger storage pool is impossible. Accessing the needed data means
locating the correct disk and plugging it into the correct computer. This is not practical at scale,
even for very small environments. We describe this only as a baseline, and it should not be
considered a reasonable enterprise-class storage solution.




    ISILON SYSTEMS
                                                   6
Redundant Array of Independent Disks (RAID)
RAID is a technology that combines two or more disks to achieve greater volume capacity,
performance, or reliability. A RAID might be directly attached to one computer as in DAS, or
indirectly by a SAN (Storage Area Network), or provide the underlying storage for a NAS
(Network Attached Storage). With RAID, data is striped across all of the disks to create a “RAID
Set” and depending on how the data is sorted to the individual disks, a RAID is highly tunable:

        RAID 0: Maximum capacity, maximum risk
        RAID 1: Maximum read performance, minimum risk
        RAID 5: Balance capacity, performance, and risk
        RAID 6: Capacity and performance, with less risk than RAID 5


Storage Area Network (SAN)




A SAN is an indirect means of attaching disks to a computer. Rather than connecting directly to
the motherboard, a computer connects indirectly through a SAN (typically by Fibre Channel,
iSCSI, or proprietary low-latency network). A SAN provides a convenient means to centralize the
management of many disks into a common storage pool and then permit the allocation of logical
portions of this pool out to one or more computers. This common storage pool can be expanded
by attaching additional disk to the SAN. SAN clients have block-level access (which appears as a
local disk) to “their” portion of the SAN, however they have no access to other portions allocated
to other computers.




    ISILON SYSTEMS
                                                7
Network Attached Storage (NAS)




DAS and SAN are methods for attaching disks to a computer, and RAID is a technology for
configuring many disks to achieve different I/O characteristics. A NAS is neither of these. A NAS
is a computer (file server) that contains (or is attached to) one or more local disks or RAIDs and
operates a file-sharing program that provides concurrent, file-level access to many computers
over a TCP/IP network. Therefore, the performance of a NAS is often limited by the number of
clients competing for access to the file server.

NAS/SAN Hybrid
Naturally, one can provide the files served by a NAS through disks provided over a SAN. A
NAS/SAN hybrid is a common approach to overcoming the performance limitations of a NAS by
distributing file server load over multiple file servers attached to a common SAN. Theoretically,
this provides a means to increase the number of file servers in response to the demand of
competing clients. This approach often fails when one of the file servers is serving a SAN volume
with more “popular” data on it. Responding to this failure requires careful monitoring of the
demand and moving/copying/synchronizing data across various SAN volumes. As capacity
grows, the management of a NAS/SAN Hybrid becomes more complex.


Asymmetric Clustered Storage




Asymmetric clustered storage is implemented by a software layer operating on top of a NAS/SAN
hybrid architecture. Storage of many SAN volumes is “clustered,” and allocated to separate
computers, but it is also asymmetric in that nodes have specialized roles in providing the service.
Each NAS file server has concurrent access to all of the SAN volumes, so the manual re-
distribution of NAS servers and SAN volumes is no longer required. This concurrency is managed
through the addition of another type of computer, frequently called the “Meta Data Controller.”
This controller ensures that file servers take proper turns accessing the SAN volumes and files.



    ISILON SYSTEMS
                                                 8
Symmetric Clustered Storage




Symmetric clustered storage provides a balanced architecture for scalable storage and
concurrent read/write access to a common network file system. It pools both the storage and the
file serving capabilities of many anonymous and identical systems. While individual “nodes” within
the cluster manage their directly attached disks, a software layer allows the nodes to work
together to present an integrated whole. This approach maximizes the ability to “scale-out”
storage by adding nodes. This is the approach offered by Isilon systems, and is described in
greater detail below.



Isilon Clustered Storage Solution

Life Science research presents an ever changing, highly automated, laboratory data environment.
This can be a major challenge for computing and storage vendors. Isilon IQ represents a flexible,
scalable system in which capacity can be added in parallel with new equipment or laboratory
capacity. Once the system is configured, scientists can reasonably expect to spend most of their
time doing science rather than constantly worrying about whether a new sequencing machine will
require a complete overhaul of the storage system. Isilon IQ’s biggest strengths are scalability of
capacity and/or performance linearly or independently, symmetric data access, and ease of use.
Several customers shared that the staffing requirements for their data storage environment
dropped to near zero once they installed the Isilon system.

The Isilon storage product is a fully symmetric, file-based, clustered architecture system. The
hardware consists of industry standard x86 servers that arrive pre-configured with Isilon’s
patented OneFS operating system. These “nodes” connect to each other using a low-latency,
high-bandwidth Infiniband network.

Filesystem clients access data over a pair of Ethernet ports on each node. Clients connect to a
single “virtual” network address and OneFS dynamically distributes the actual connections to the
nodes. This means that input/output bandwidth, caching, and latency are shared across the entire



    ISILON SYSTEMS
                                                 9
system. Performance scales smoothly as nodes are added, in contrast to gateway based master-
slave architectures where the gateway inevitably becomes a bottleneck. Because all connections
go to the same virtual address, adding or removing nodes from the system requires no client-side
reconfiguration. Users simply continue with the same data volume, but with more capacity.
OneFS manages all aspects of the system including detecting and optimizing new nodes. This
makes configuration and expansion incredibly simple. All the usual complexity associated with
adding new storage is an “under the hood” activity managed by the operating system.

Administrators add new nodes to an existing cluster by connecting the network and power cables
and then powering on. The nodes detect an existing cluster and make themselves available to it.
OneFS then smoothly re-balances data across all the nodes. During this process, all filesystem
mount points remain available. Compared with the usual headache of taking down an entire data
storage system for days at a time (to migrate existing data off to tapes, create a new larger
system, and re-import the data), this is a fairly incredible benefit.

Isilon nodes of different ages and sizes can be intermixed. This eliminates one of the major risks
of investing in an integrated storage system. Computing hardware doubles in capacity and speed
on a regular basis, meaning that any system intended for use over more than a year or two must
take into account that expansions may consist of substantially different hardware than the original
system. Storage architectures without the capability to smoothly support heterogeneous hardware
frequently require a “forklift” upgrade, in which the old system is decommissioned to make space
for a new one. Organizations are then in a constant state of flux - migrating data, with the
associated chaos in both user access and automated pipelines. With an Isilon system, by
contrast, older nodes may remain in service as long as their hardware is functional. Since data
management and re-balancing are managed by OneFS, when it is time to retire a storage node,
administrators simply instruct the system to migrate the data off of that node, wait for the
migration to complete, and then turn off and remove it from the cluster. During migration all data
is available for user access.

OneFS Operating System
To address the scalable distributed file system requirement of clustered storage, Isilon built,
patented, and delivered the revolutionary OneFS operating system, which combines three layers
of traditional storage architectures – the file system, volume manager and RAID controller – into a
unified software layer. This creates a single intelligent, fully symmetrical file system which spans
all nodes within a cluster.

File striping in the cluster takes place across multiple storage nodes versus the traditional method
of striping across individual disks within a volume/RAID array. This provides a very specific
benefit: no one node is 100% responsible for any particular file. An Isilon IQ system can
withstand the loss of multiple disks or entire nodes without losing access to any data. OneFS
provides each node with knowledge of the entire file system layout. Accessing any independent
node gives a user access to all content in one unified namespace, meaning that there are no
volumes or shares, no inflexible volume size limits, no downtime for reconfiguration or expansion
of storage and no multiple network drives to manage. This seamless integration of a fully
symmetric clustered architecture makes Isilon unique in the storage landscape.

Inherent High Availability & Reliability
Non-symmetric data storage architectures contain intrinsic dependencies and create points of
failure and bottlenecks. One way to ensure data integrity and eliminate single points of failure is
to make all nodes in a cluster peers. Each node in an Isilon cluster can handle a request from any
client, and can provide any content. If any particular node fails, the other nodes dynamically fill in.
This is true for all the tasks of the filesystem. Because there is no dedicated “head” to the system,
all aspects of system performance can be balanced across the entire system. This means that
I/O, availability, storage capacity, and resilience to hardware failure are all distributed.

    ISILON SYSTEMS
                                                  10
OneFS has further increased the availability of Isilon’s clustered storage solution by providing
multi-failure support of n+4. Simply stated, this means an Isilon cluster can withstand the
simultaneous loss of an unprecedented four disks or four entire nodes without losing access to
any content – and without requiring dedicated parity drives. Additionally, self-healing capabilities
and high levels of hardware redundancy greatly reduce the chances of a production node failing
in the first place.

In the event of a failure, OneFS automatically re-builds files across all of the existing distributed
free space in the cluster in parallel, eliminating the need for dedicated “hot” spare drives. OneFS
leverages all available free space across all nodes in the cluster to rebuild data. This minimizes
the window of vulnerability during the rebuild process.

Isilon IQ constantly monitors the health of all files and disks and maintains records of the smart
statistics (e.g. recoverable read errors) available on each drive to anticipate when that drive will
fail. This is simply an automation of a “best practice” that many system administrators wish they
had time to do. When Isilon IQ identifies at risk components, it preemptively migrates data off of
the “at risk” disk to available free space on the cluster in a manner that is both automatic and
transparent to the user. Once the data is migrated, the administrator is notified to service the
suspect drive in advance of actual failure. This feature provides customers with confidence that
data written today will be stored 100 percent reliably, and available whenever it is needed. No
other solution today provides this level of data protection and reliability.

Single Level of Management
An Isilon IQ cluster creates a single, shared pool of all content, providing one point of access for
users and one point of management for administrators, with a single file system or volume of up
to 1.6 petabytes. Users can connect to any storage node and securely access all of the content
within the entire cluster. This means there is single point for all applications to connect to and that
every application has visibility and access to every file in the entire file system (per security and
permission policies, of course).

Linear Scalability in Performance & Capacity
One of the key benefits of an Isilon IQ cluster is the ease with which it allows administrators to
add both performance and capacity without downtime or application changes. System
administrators simply insert a new Isilon IQ storage node, connect the network cables and power
up. The cluster automatically detects the newly added storage node and begins to configure it as
a member of the cluster. In less than 60 seconds, an administrator can grow a single file system
by 2 to 12 terabytes, and increase throughput performance an additional 2 Gigabits/second.
Isilon’s modular approach offers a building block (or “pay-as-you-grow”) solution so customers
aren’t forced to buy more storage capacity than is needed up front. The modular design of an
Isilon cluster also enables customers to incorporate new technologies in the same cluster, such
as adding a node with higher-density disk drives, more CPU horsepower or more total
performance.



Conclusion

We have described the data storage needs of Life Science researchers, summarized the major
data storage architectures currently in use, and presented the Isilon IQ product as a strong and
flexible solution to those needs. Data storage is a critical component of modern scientific
research. As smaller labs and individual researchers become responsible for terabytes and
petabytes of data, understanding the options and trade-offs will become ever more critical.


    ISILON SYSTEMS
                                                  11

Weitere ähnliche Inhalte

Was ist angesagt?

HPC lab projects
HPC lab projectsHPC lab projects
HPC lab projectsJason Riedy
 
Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?LIBER Europe
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
Software Defined storage
Software Defined storageSoftware Defined storage
Software Defined storageKirillos Akram
 
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...Ceph Community
 
Introducing Lattus Object Storage
Introducing Lattus Object StorageIntroducing Lattus Object Storage
Introducing Lattus Object StorageQuantum
 
Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solution...
Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solution...Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solution...
Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solution...Principled Technologies
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiUnmesh Baile
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File SystemsManish Chopra
 

Was ist angesagt? (12)

HPC lab projects
HPC lab projectsHPC lab projects
HPC lab projects
 
Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Software Defined storage
Software Defined storageSoftware Defined storage
Software Defined storage
 
Whither Small Data?
Whither Small Data?Whither Small Data?
Whither Small Data?
 
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
Best Practices with Ceph as Distributed, Intelligent, Unified Cloud Storage -...
 
Introducing Lattus Object Storage
Introducing Lattus Object StorageIntroducing Lattus Object Storage
Introducing Lattus Object Storage
 
Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solution...
Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solution...Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solution...
Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solution...
 
635 642
635 642635 642
635 642
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbai
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File Systems
 

Andere mochten auch

Internet i els drets fonamentals
Internet i els drets fonamentalsInternet i els drets fonamentals
Internet i els drets fonamentalsGrup8
 
Photo Album Important Events
Photo Album Important EventsPhoto Album Important Events
Photo Album Important EventsBhadresh Shah
 
Addison Whitney Healthcare Group
Addison Whitney Healthcare GroupAddison Whitney Healthcare Group
Addison Whitney Healthcare GroupJonathan Hall
 
Katekistak 2012 GARIZUMA
Katekistak 2012 GARIZUMA Katekistak 2012 GARIZUMA
Katekistak 2012 GARIZUMA pablokueto
 
共融觀點探討銀髮與孩童互動問題-資料蒐集
共融觀點探討銀髮與孩童互動問題-資料蒐集共融觀點探討銀髮與孩童互動問題-資料蒐集
共融觀點探討銀髮與孩童互動問題-資料蒐集tangerineqq
 
Media And Entertainment Whitepaper 090308
Media And Entertainment Whitepaper 090308Media And Entertainment Whitepaper 090308
Media And Entertainment Whitepaper 090308sydcarr
 
Dsmith soc5025
Dsmith soc5025Dsmith soc5025
Dsmith soc5025desmythe
 
American And Prov Cert
American And Prov CertAmerican And Prov Cert
American And Prov CertBhadresh Shah
 
Cuaresma 2012 - catequistas
Cuaresma 2012 - catequistasCuaresma 2012 - catequistas
Cuaresma 2012 - catequistaspablokueto
 

Andere mochten auch (10)

Internet i els drets fonamentals
Internet i els drets fonamentalsInternet i els drets fonamentals
Internet i els drets fonamentals
 
Bioloid Simulator
Bioloid SimulatorBioloid Simulator
Bioloid Simulator
 
Photo Album Important Events
Photo Album Important EventsPhoto Album Important Events
Photo Album Important Events
 
Addison Whitney Healthcare Group
Addison Whitney Healthcare GroupAddison Whitney Healthcare Group
Addison Whitney Healthcare Group
 
Katekistak 2012 GARIZUMA
Katekistak 2012 GARIZUMA Katekistak 2012 GARIZUMA
Katekistak 2012 GARIZUMA
 
共融觀點探討銀髮與孩童互動問題-資料蒐集
共融觀點探討銀髮與孩童互動問題-資料蒐集共融觀點探討銀髮與孩童互動問題-資料蒐集
共融觀點探討銀髮與孩童互動問題-資料蒐集
 
Media And Entertainment Whitepaper 090308
Media And Entertainment Whitepaper 090308Media And Entertainment Whitepaper 090308
Media And Entertainment Whitepaper 090308
 
Dsmith soc5025
Dsmith soc5025Dsmith soc5025
Dsmith soc5025
 
American And Prov Cert
American And Prov CertAmerican And Prov Cert
American And Prov Cert
 
Cuaresma 2012 - catequistas
Cuaresma 2012 - catequistasCuaresma 2012 - catequistas
Cuaresma 2012 - catequistas
 

Ähnlich wie Managing Rapidly Growing Science Data

Research and technology explosion in scale-out storage
Research and technology explosion in scale-out storageResearch and technology explosion in scale-out storage
Research and technology explosion in scale-out storageJeff Spencer
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtGenoveva Vargas-Solar
 
Hitachi high-performance-accelerates-life-sciences-research
Hitachi high-performance-accelerates-life-sciences-researchHitachi high-performance-accelerates-life-sciences-research
Hitachi high-performance-accelerates-life-sciences-researchHitachi Vantara
 
Accelerate Discovery
Accelerate DiscoveryAccelerate Discovery
Accelerate DiscoveryPanasas
 
spectrum Storage Whitepaper
spectrum Storage Whitepaperspectrum Storage Whitepaper
spectrum Storage WhitepaperCarina Kordan
 
Chip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochureChip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochureMarco van der Hart
 
Intel life sciences_personalizedmedicine_stanford biomed 052214 dist
Intel life sciences_personalizedmedicine_stanford biomed 052214 distIntel life sciences_personalizedmedicine_stanford biomed 052214 dist
Intel life sciences_personalizedmedicine_stanford biomed 052214 distKetan Paranjape
 
Platform Overview Brochure
Platform Overview BrochurePlatform Overview Brochure
Platform Overview Brochuredstam1
 
National Institutes of Health Maximize Computing Resources with Panasas
National Institutes of Health Maximize Computing Resources with PanasasNational Institutes of Health Maximize Computing Resources with Panasas
National Institutes of Health Maximize Computing Resources with PanasasPanasas
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...dbpublications
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudOla Spjuth
 
Workload Centric Scale-Out Storage for Next Generation Datacenter
Workload Centric Scale-Out Storage for Next Generation DatacenterWorkload Centric Scale-Out Storage for Next Generation Datacenter
Workload Centric Scale-Out Storage for Next Generation DatacenterCloudian
 
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio, Inc.
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Net App Unified Storage Architecture
Net App Unified Storage ArchitectureNet App Unified Storage Architecture
Net App Unified Storage ArchitectureSamantha_Roehl
 
Net App Unified Storage Architecture
Net App Unified Storage ArchitectureNet App Unified Storage Architecture
Net App Unified Storage Architecturenburgett
 
Genomics Center Compares 100s of Computations Simultaneously with Panasas
Genomics Center Compares 100s of Computations Simultaneously with PanasasGenomics Center Compares 100s of Computations Simultaneously with Panasas
Genomics Center Compares 100s of Computations Simultaneously with PanasasPanasas
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational ScienceChelle Gentemann
 
Lecture 24
Lecture 24Lecture 24
Lecture 24Shani729
 

Ähnlich wie Managing Rapidly Growing Science Data (20)

Research and technology explosion in scale-out storage
Research and technology explosion in scale-out storageResearch and technology explosion in scale-out storage
Research and technology explosion in scale-out storage
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbt
 
Hitachi high-performance-accelerates-life-sciences-research
Hitachi high-performance-accelerates-life-sciences-researchHitachi high-performance-accelerates-life-sciences-research
Hitachi high-performance-accelerates-life-sciences-research
 
Accelerate Discovery
Accelerate DiscoveryAccelerate Discovery
Accelerate Discovery
 
spectrum Storage Whitepaper
spectrum Storage Whitepaperspectrum Storage Whitepaper
spectrum Storage Whitepaper
 
Chip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochureChip ICT | Hgst storage brochure
Chip ICT | Hgst storage brochure
 
Intel life sciences_personalizedmedicine_stanford biomed 052214 dist
Intel life sciences_personalizedmedicine_stanford biomed 052214 distIntel life sciences_personalizedmedicine_stanford biomed 052214 dist
Intel life sciences_personalizedmedicine_stanford biomed 052214 dist
 
Platform Overview Brochure
Platform Overview BrochurePlatform Overview Brochure
Platform Overview Brochure
 
National Institutes of Health Maximize Computing Resources with Panasas
National Institutes of Health Maximize Computing Resources with PanasasNational Institutes of Health Maximize Computing Resources with Panasas
National Institutes of Health Maximize Computing Resources with Panasas
 
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
Secure and Efficient Client and Server Side Data Deduplication to Reduce Stor...
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and Cloud
 
Workload Centric Scale-Out Storage for Next Generation Datacenter
Workload Centric Scale-Out Storage for Next Generation DatacenterWorkload Centric Scale-Out Storage for Next Generation Datacenter
Workload Centric Scale-Out Storage for Next Generation Datacenter
 
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Net App Unified Storage Architecture
Net App Unified Storage ArchitectureNet App Unified Storage Architecture
Net App Unified Storage Architecture
 
Net App Unified Storage Architecture
Net App Unified Storage ArchitectureNet App Unified Storage Architecture
Net App Unified Storage Architecture
 
Genomics Center Compares 100s of Computations Simultaneously with Panasas
Genomics Center Compares 100s of Computations Simultaneously with PanasasGenomics Center Compares 100s of Computations Simultaneously with Panasas
Genomics Center Compares 100s of Computations Simultaneously with Panasas
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
Lecture 24
Lecture 24Lecture 24
Lecture 24
 

Managing Rapidly Growing Science Data

  • 1. Storage for Science Methods for Managing Large and Rapidly Growing Data Stores in Life Science Research Environments An Isilon® Systems Whitepaper August 2008 Prepared by:
  • 2. Table of Contents Introduction 3 Requirements for Science 3 “Large” Capacity 3 Accelerating Growth 4 Variable File Types and Operations 4 Shared Read/Write Access 4 Ease of Use 5 Understanding the Alternatives 5 Common Feature Trade-offs 5 Direct Attached Storage (DAS) 6 Storage Area Network (SAN) 7 Network Attached Storage (NAS) 8 Asymmetric Clustered Storage 8 Symmetric Clustered Storage 9 Isilon Clustered Storage Solution 9 OneFS Operating System 10 Inherent High Availability & Reliability 10 Single Level of Management 11 Linear Scalability in Performance & Capacity 11 Conclusion 11 ISILON SYSTEMS 2
  • 3. Introduction This document is intended to inform the Life Science researcher with large and rapidly growing data storage needs. We explore many of the storage requirements common to Life Science research and explain the evolution of modern storage architectures from local disks through symmetric clustered storage. Finally, we present Isilon’s IQ clustered storage solution in detail. Requirements for Science “Large” Capacity Many branches of Life Science research involve the generation, accumulation, analysis, and distribution of “large” amounts of data. What is considered “large” changes rapidly as data generation increases through advances in scientific methods and instrumentation. These advances are offset by capacity increases in storage technologies that are undergoing their own rapid evolution. Presently, Neuro-Imaging and Next-Generation Sequencing are branches of science churning out massive amounts of data that push the limits of “large”. We will explore these two specific examples in further detail. Neuro-Imaging A common Neuro-Imaging experiment involves fMRI (Functional Magnetic Resonance Imaging) to determine activated regions of the brain in response to a stimulus. This “brain mapping” is achieved by observing increased blood flow to the activated areas of the brain using an fMRI scanner. The scanning of a single human test subject might occur over a 60 to 90 minute period, with hundreds of discrete scans every few seconds, generating as much as 1GB of data per subject. A single instrument operating at only 50% capacity can produce many terabytes (1,000s of GBs) of data per year. The Neuro-Imaging centers interviewed for this paper utilize up to ten instruments, supporting dozens of scientists, each allocated a baseline of 2TB of disk space for their ongoing experiments. While this rapid scaling is a significant challenge for many labs, data growth of 10 to 20 TB per year is not unusual in these environments. “Next-Generation” DNA Sequencing DNA sequencing has undergone a revolution in recent years. Driven by novel sequencing chemistries, micro-fluidic systems, and reaction detection methods, “Next-Generation” sequencing instruments from 454, Illumina, ABI, and Helicos offer 100 to 1000-fold increased throughput, combined with an additional 100 to 1000-fold decreased cost per nucleotide when compared with conventional Sanger sequencing. This change has put high-throughput genome sequencing, once achievable by only a few major sequencing centers, within reach of many smaller research groups and individual research labs. The result for such labs is a dramatic increase in storage requirements from gigabytes to petabytes (1 million GB) in only the course of a couple of years. Each Next-Generation sequencing platform is unique in terms of the nature and volume of the data it generates. Typically, anywhere from 600GB (gigabytes) to 6TB (terabytes) of primary image data is written over a period of one to three days. By today’s standards, a terabyte is not large. However, for a single laboratory, accumulating and moving terabytes of data per day without loss can be a significant challenge, especially for small sequencing labs that have not yet adopted a highly scalable storage solution. ISILON SYSTEMS 3
  • 4. Accelerating Growth Storage capacity planning for Life Science research is particularly difficult in that requirements change rapidly and at irregular rates. Planning for growth according to the number of users or number of instruments is often insufficient when, for instance, a new grant can double capacity needs. Similarly a revolutionary new instrument might increase data production by an exponential amount. To be responsive to the requirements of Life Science research, an ideal storage architecture must be scalable in both small and large increments without requiring a system redesign or replacement. Ideally, a storage solution should have “pay-as-you-grow” characteristics that allow for growth as-needed. Variable File Types and Operations Life Science data is highly variable, both in composition and in the way that it is accessed. Therefore, an ideal storage system for Life Science organizations must have good I/O performance across these varied use cases: - Many small files or fewer big files - Text files and binary files - Sequential and random access - Highly concurrent access This variability is common to both neuro-imaging and next-generation sequencing. Massive simultaneous computations are performed upon many, large primary image files ranging in the gigabytes and requiring highly parallel streaming (I/O), resulting in fewer, smaller text files. The resulting data might be kept within directories containing thousands to hundreds-of-thousands of files, totaling many terabytes. Shared Read/Write Access Storage systems for Life Science data must be simultaneously accessible to many instruments, users, analysis computers, and data servers. These storage systems cannot reside in isolated silos with limited accessibility. They must, instead, permit concurrent, integrated, file-level read/write access across the entire organization with I/O bandwidth that scales to accommodate concurrent demand. A typical Neuro-Imaging or Next-Generation sequencing workflow involves the following steps: - Multiple instruments generate primary image data. - Large memory SMP machines and compute clusters distill the primary data into a derived form. - Researchers evaluate and annotate the data to answer scientific questions. - Researchers iterate on the above process, adding more primary data and refining their analyses. - Finally, results are served to a wider audience via internet repositories, usually accessed via FTP or HTTP. The requirements of the workflow above are the sum of requirements from instruments, researchers, computing systems, and customers. A sustainable storage plan for even a small research organization requires a system with shared, file-level read/write access to a common, large, scalable storage repository and should allow access by these common protocols: - NFS (Network File System) – The common network file system for UNIX instruments and analysis computers - SMB/CIFS (Server Message Block/Common Internet File System) – The common network file system for Windows-based instruments and user desktops ISILON SYSTEMS 4
  • 5. - HTTP (Hypertext Transfer Protocol) – The file transfer protocol used in the World Wide Web - FTP (File Transfer Protocol) – A common internet file transfer protocol for disseminating data Ease of Use At many levels, ease of use is the most significant storage requirement for Life Science research, even though it is generally the most difficult to quantify. Management The human resources involved in maintaining a large storage system range from just above zero to many FTEs (full time equivalents). The management of an ideal storage system should not require the hiring of additional, dedicated IT staff. Scaling Scaling a storage system’s capacity and/or performance, whether by fractional amounts or by orders of magnitude, should not require multiple man-months of meetings to plan, or even several man-days of IT technical expertise to implement. Scaling an ideal storage system should be able to be performed in minutes, independent of scale. User Ideally, the researcher is focused on science, not computers or disks. The researcher shouldn’t be concerned with or aware of volumes, capacity, formats, or how to access their data. Upon scaling storage a user might notice that capacity suddenly increased, but never experience an interruption in service. Understanding the Alternatives Common Feature Trade-offs Like most products, storage solutions compete based on their features. An ideal storage solution would excel at all features: have high I/O performance rates, never become inaccessible, never lose data, have the ability to become infinitely large, be scalable in both large and small increments, have a low purchase price, require little human effort to manage, and be easy to use. In the real world, decisions are based on which of these requirements are most important within a given budget. Storage decisions typically reduce to four factors: - Will this provide me sufficient performance and capacity for my present needs? - Will I experience any significant down-time or data loss? - Do I have the human resources needed to manage the system? - How long will it be before I need to upgrade this system and at what cost? When designing storage systems in a scientific research environment, many variables come into play. Present capacity needs may be the easiest to quantify, but are only a starting point. Performance requirements aren’t generally known until after the storage has been deployed and workflows are executed against data. Data loss is known to be a very bad thing, but quantifying the cost of loss is difficult when the core value to the lab might be a publication or a discovery. Labor costs may be very indirect; the use of graduate students as part-time systems ISILON SYSTEMS 5
  • 6. administrators is a prime example. Students come and go, which can impose high additional costs if storage systems are difficult to learn or require specialized training. Particularly in primary research, very early in a product pipeline, it can be difficult to set dollar values on these factors. However, such variables must be considered in order to make sensible storage infrastructure decisions. All storage systems have maximum scaling limits. However with some, once this limit is reached one must either (a) deploy a 2nd standalone system or (b) perform a “forklift upgrade” in which major components are retired and replaced with bigger/newer ones. In addition to the decision parameters above, some storage architectures can be tuned to optimize certain facets of their performance. This means that they can be configured to excel at one feature or another, but not all at the same time: - I/O Performance – The speed of writing data to disk and reading it back to the CPU and user. This might be further broken down into transactional, sequential, and random access patterns. - Availability – The cost associated with maintaining uninterrupted access to the data - Reliability – The cost associated with mitigating the risk of data loss - Maximum Scalability – The largest data volume the storage system can ever hold - Dynamic Configuration – The cost in time and effort to make a change to the system - Resolution of Scale – The smallest increment by which the storage can be made larger - Purchase Price – The cost to purchase the system - Total Cost of Ownership – The cost to buy and operate the system over its useful life - Ease of Use – The cost in time and effort to get it working and keep it working Direct Attached Storage (DAS) Discrete Disk The local disk method is the most familiar means of providing storage, either directly attached to the motherboard of a system, or attached through USB, Firewire, SCSI, or Fibre Channel cables. While it may seem odd (when discussing critical, enterprise-class storage systems) to consider using discrete local disks, due to the explosion of data and the ubiquity of simple DAS type storage, labs containing shelves filled with Firewire drives being used as long-term archive solutions are not that uncommon. These discrete disks can be purchased for as little as $200- $300 per TB and are fairly easy to use. In the very short term, this may seem to be a reasonable purchasing decision, coupled with the reagents necessary for an instrument run. However, combining these disks into a larger storage pool is impossible. Accessing the needed data means locating the correct disk and plugging it into the correct computer. This is not practical at scale, even for very small environments. We describe this only as a baseline, and it should not be considered a reasonable enterprise-class storage solution. ISILON SYSTEMS 6
  • 7. Redundant Array of Independent Disks (RAID) RAID is a technology that combines two or more disks to achieve greater volume capacity, performance, or reliability. A RAID might be directly attached to one computer as in DAS, or indirectly by a SAN (Storage Area Network), or provide the underlying storage for a NAS (Network Attached Storage). With RAID, data is striped across all of the disks to create a “RAID Set” and depending on how the data is sorted to the individual disks, a RAID is highly tunable: RAID 0: Maximum capacity, maximum risk RAID 1: Maximum read performance, minimum risk RAID 5: Balance capacity, performance, and risk RAID 6: Capacity and performance, with less risk than RAID 5 Storage Area Network (SAN) A SAN is an indirect means of attaching disks to a computer. Rather than connecting directly to the motherboard, a computer connects indirectly through a SAN (typically by Fibre Channel, iSCSI, or proprietary low-latency network). A SAN provides a convenient means to centralize the management of many disks into a common storage pool and then permit the allocation of logical portions of this pool out to one or more computers. This common storage pool can be expanded by attaching additional disk to the SAN. SAN clients have block-level access (which appears as a local disk) to “their” portion of the SAN, however they have no access to other portions allocated to other computers. ISILON SYSTEMS 7
  • 8. Network Attached Storage (NAS) DAS and SAN are methods for attaching disks to a computer, and RAID is a technology for configuring many disks to achieve different I/O characteristics. A NAS is neither of these. A NAS is a computer (file server) that contains (or is attached to) one or more local disks or RAIDs and operates a file-sharing program that provides concurrent, file-level access to many computers over a TCP/IP network. Therefore, the performance of a NAS is often limited by the number of clients competing for access to the file server. NAS/SAN Hybrid Naturally, one can provide the files served by a NAS through disks provided over a SAN. A NAS/SAN hybrid is a common approach to overcoming the performance limitations of a NAS by distributing file server load over multiple file servers attached to a common SAN. Theoretically, this provides a means to increase the number of file servers in response to the demand of competing clients. This approach often fails when one of the file servers is serving a SAN volume with more “popular” data on it. Responding to this failure requires careful monitoring of the demand and moving/copying/synchronizing data across various SAN volumes. As capacity grows, the management of a NAS/SAN Hybrid becomes more complex. Asymmetric Clustered Storage Asymmetric clustered storage is implemented by a software layer operating on top of a NAS/SAN hybrid architecture. Storage of many SAN volumes is “clustered,” and allocated to separate computers, but it is also asymmetric in that nodes have specialized roles in providing the service. Each NAS file server has concurrent access to all of the SAN volumes, so the manual re- distribution of NAS servers and SAN volumes is no longer required. This concurrency is managed through the addition of another type of computer, frequently called the “Meta Data Controller.” This controller ensures that file servers take proper turns accessing the SAN volumes and files. ISILON SYSTEMS 8
  • 9. Symmetric Clustered Storage Symmetric clustered storage provides a balanced architecture for scalable storage and concurrent read/write access to a common network file system. It pools both the storage and the file serving capabilities of many anonymous and identical systems. While individual “nodes” within the cluster manage their directly attached disks, a software layer allows the nodes to work together to present an integrated whole. This approach maximizes the ability to “scale-out” storage by adding nodes. This is the approach offered by Isilon systems, and is described in greater detail below. Isilon Clustered Storage Solution Life Science research presents an ever changing, highly automated, laboratory data environment. This can be a major challenge for computing and storage vendors. Isilon IQ represents a flexible, scalable system in which capacity can be added in parallel with new equipment or laboratory capacity. Once the system is configured, scientists can reasonably expect to spend most of their time doing science rather than constantly worrying about whether a new sequencing machine will require a complete overhaul of the storage system. Isilon IQ’s biggest strengths are scalability of capacity and/or performance linearly or independently, symmetric data access, and ease of use. Several customers shared that the staffing requirements for their data storage environment dropped to near zero once they installed the Isilon system. The Isilon storage product is a fully symmetric, file-based, clustered architecture system. The hardware consists of industry standard x86 servers that arrive pre-configured with Isilon’s patented OneFS operating system. These “nodes” connect to each other using a low-latency, high-bandwidth Infiniband network. Filesystem clients access data over a pair of Ethernet ports on each node. Clients connect to a single “virtual” network address and OneFS dynamically distributes the actual connections to the nodes. This means that input/output bandwidth, caching, and latency are shared across the entire ISILON SYSTEMS 9
  • 10. system. Performance scales smoothly as nodes are added, in contrast to gateway based master- slave architectures where the gateway inevitably becomes a bottleneck. Because all connections go to the same virtual address, adding or removing nodes from the system requires no client-side reconfiguration. Users simply continue with the same data volume, but with more capacity. OneFS manages all aspects of the system including detecting and optimizing new nodes. This makes configuration and expansion incredibly simple. All the usual complexity associated with adding new storage is an “under the hood” activity managed by the operating system. Administrators add new nodes to an existing cluster by connecting the network and power cables and then powering on. The nodes detect an existing cluster and make themselves available to it. OneFS then smoothly re-balances data across all the nodes. During this process, all filesystem mount points remain available. Compared with the usual headache of taking down an entire data storage system for days at a time (to migrate existing data off to tapes, create a new larger system, and re-import the data), this is a fairly incredible benefit. Isilon nodes of different ages and sizes can be intermixed. This eliminates one of the major risks of investing in an integrated storage system. Computing hardware doubles in capacity and speed on a regular basis, meaning that any system intended for use over more than a year or two must take into account that expansions may consist of substantially different hardware than the original system. Storage architectures without the capability to smoothly support heterogeneous hardware frequently require a “forklift” upgrade, in which the old system is decommissioned to make space for a new one. Organizations are then in a constant state of flux - migrating data, with the associated chaos in both user access and automated pipelines. With an Isilon system, by contrast, older nodes may remain in service as long as their hardware is functional. Since data management and re-balancing are managed by OneFS, when it is time to retire a storage node, administrators simply instruct the system to migrate the data off of that node, wait for the migration to complete, and then turn off and remove it from the cluster. During migration all data is available for user access. OneFS Operating System To address the scalable distributed file system requirement of clustered storage, Isilon built, patented, and delivered the revolutionary OneFS operating system, which combines three layers of traditional storage architectures – the file system, volume manager and RAID controller – into a unified software layer. This creates a single intelligent, fully symmetrical file system which spans all nodes within a cluster. File striping in the cluster takes place across multiple storage nodes versus the traditional method of striping across individual disks within a volume/RAID array. This provides a very specific benefit: no one node is 100% responsible for any particular file. An Isilon IQ system can withstand the loss of multiple disks or entire nodes without losing access to any data. OneFS provides each node with knowledge of the entire file system layout. Accessing any independent node gives a user access to all content in one unified namespace, meaning that there are no volumes or shares, no inflexible volume size limits, no downtime for reconfiguration or expansion of storage and no multiple network drives to manage. This seamless integration of a fully symmetric clustered architecture makes Isilon unique in the storage landscape. Inherent High Availability & Reliability Non-symmetric data storage architectures contain intrinsic dependencies and create points of failure and bottlenecks. One way to ensure data integrity and eliminate single points of failure is to make all nodes in a cluster peers. Each node in an Isilon cluster can handle a request from any client, and can provide any content. If any particular node fails, the other nodes dynamically fill in. This is true for all the tasks of the filesystem. Because there is no dedicated “head” to the system, all aspects of system performance can be balanced across the entire system. This means that I/O, availability, storage capacity, and resilience to hardware failure are all distributed. ISILON SYSTEMS 10
  • 11. OneFS has further increased the availability of Isilon’s clustered storage solution by providing multi-failure support of n+4. Simply stated, this means an Isilon cluster can withstand the simultaneous loss of an unprecedented four disks or four entire nodes without losing access to any content – and without requiring dedicated parity drives. Additionally, self-healing capabilities and high levels of hardware redundancy greatly reduce the chances of a production node failing in the first place. In the event of a failure, OneFS automatically re-builds files across all of the existing distributed free space in the cluster in parallel, eliminating the need for dedicated “hot” spare drives. OneFS leverages all available free space across all nodes in the cluster to rebuild data. This minimizes the window of vulnerability during the rebuild process. Isilon IQ constantly monitors the health of all files and disks and maintains records of the smart statistics (e.g. recoverable read errors) available on each drive to anticipate when that drive will fail. This is simply an automation of a “best practice” that many system administrators wish they had time to do. When Isilon IQ identifies at risk components, it preemptively migrates data off of the “at risk” disk to available free space on the cluster in a manner that is both automatic and transparent to the user. Once the data is migrated, the administrator is notified to service the suspect drive in advance of actual failure. This feature provides customers with confidence that data written today will be stored 100 percent reliably, and available whenever it is needed. No other solution today provides this level of data protection and reliability. Single Level of Management An Isilon IQ cluster creates a single, shared pool of all content, providing one point of access for users and one point of management for administrators, with a single file system or volume of up to 1.6 petabytes. Users can connect to any storage node and securely access all of the content within the entire cluster. This means there is single point for all applications to connect to and that every application has visibility and access to every file in the entire file system (per security and permission policies, of course). Linear Scalability in Performance & Capacity One of the key benefits of an Isilon IQ cluster is the ease with which it allows administrators to add both performance and capacity without downtime or application changes. System administrators simply insert a new Isilon IQ storage node, connect the network cables and power up. The cluster automatically detects the newly added storage node and begins to configure it as a member of the cluster. In less than 60 seconds, an administrator can grow a single file system by 2 to 12 terabytes, and increase throughput performance an additional 2 Gigabits/second. Isilon’s modular approach offers a building block (or “pay-as-you-grow”) solution so customers aren’t forced to buy more storage capacity than is needed up front. The modular design of an Isilon cluster also enables customers to incorporate new technologies in the same cluster, such as adding a node with higher-density disk drives, more CPU horsepower or more total performance. Conclusion We have described the data storage needs of Life Science researchers, summarized the major data storage architectures currently in use, and presented the Isilon IQ product as a strong and flexible solution to those needs. Data storage is a critical component of modern scientific research. As smaller labs and individual researchers become responsible for terabytes and petabytes of data, understanding the options and trade-offs will become ever more critical. ISILON SYSTEMS 11