Managing Rapidly Growing Science Data

Storage for Science

Methods for Managing Large and Rapidly Growing Data Stores
in Life Science Research Environments

An Isilon® Systems Whitepaper

August 2008

Prepared by:

Table of Contents

Introduction 3

Requirements for Science 3

“Large” Capacity 3

Accelerating Growth 4

Variable File Types and Operations 4

Shared Read/Write Access 4

Ease of Use 5

Understanding the Alternatives 5

Common Feature Trade-offs 5

Direct Attached Storage (DAS) 6

Storage Area Network (SAN) 7

Network Attached Storage (NAS) 8

Asymmetric Clustered Storage 8

Symmetric Clustered Storage 9

Isilon Clustered Storage Solution 9

OneFS Operating System 10

Inherent High Availability & Reliability 10

Single Level of Management 11

Linear Scalability in Performance & Capacity 11

Conclusion 11

ISILON SYSTEMS
2

Introduction

This document is intended to inform the Life Science researcher with large and rapidly growing
data storage needs. We explore many of the storage requirements common to Life Science
research and explain the evolution of modern storage architectures from local disks through
symmetric clustered storage. Finally, we present Isilon’s IQ clustered storage solution in detail.

Requirements for Science

“Large” Capacity
Many branches of Life Science research involve the generation, accumulation, analysis, and
distribution of “large” amounts of data. What is considered “large” changes rapidly as data
generation increases through advances in scientific methods and instrumentation. These
advances are offset by capacity increases in storage technologies that are undergoing their own
rapid evolution. Presently, Neuro-Imaging and Next-Generation Sequencing are branches of
science churning out massive amounts of data that push the limits of “large”. We will explore
these two specific examples in further detail.

Neuro-Imaging
A common Neuro-Imaging experiment involves fMRI (Functional Magnetic Resonance Imaging)
to determine activated regions of the brain in response to a stimulus. This “brain mapping” is
achieved by observing increased blood flow to the activated areas of the brain using an fMRI
scanner. The scanning of a single human test subject might occur over a 60 to 90 minute period,
with hundreds of discrete scans every few seconds, generating as much as 1GB of data per
subject. A single instrument operating at only 50% capacity can produce many terabytes (1,000s
of GBs) of data per year. The Neuro-Imaging centers interviewed for this paper utilize up to ten
instruments, supporting dozens of scientists, each allocated a baseline of 2TB of disk space for
their ongoing experiments. While this rapid scaling is a significant challenge for many labs, data
growth of 10 to 20 TB per year is not unusual in these environments.

“Next-Generation” DNA Sequencing
DNA sequencing has undergone a revolution in recent years. Driven by novel sequencing
chemistries, micro-fluidic systems, and reaction detection methods, “Next-Generation”
sequencing instruments from 454, Illumina, ABI, and Helicos offer 100 to 1000-fold increased
throughput, combined with an additional 100 to 1000-fold decreased cost per nucleotide when
compared with conventional Sanger sequencing. This change has put high-throughput genome
sequencing, once achievable by only a few major sequencing centers, within reach of many
smaller research groups and individual research labs. The result for such labs is a dramatic
increase in storage requirements from gigabytes to petabytes (1 million GB) in only the course of
a couple of years.

Each Next-Generation sequencing platform is unique in terms of the nature and volume of the
data it generates. Typically, anywhere from 600GB (gigabytes) to 6TB (terabytes) of primary
image data is written over a period of one to three days. By today’s standards, a terabyte is not
large. However, for a single laboratory, accumulating and moving terabytes of data per day
without loss can be a significant challenge, especially for small sequencing labs that have not yet
adopted a highly scalable storage solution.

ISILON SYSTEMS
3

Accelerating Growth
Storage capacity planning for Life Science research is particularly difficult in that requirements
change rapidly and at irregular rates. Planning for growth according to the number of users or
number of instruments is often insufficient when, for instance, a new grant can double capacity
needs. Similarly a revolutionary new instrument might increase data production by an exponential
amount. To be responsive to the requirements of Life Science research, an ideal storage
architecture must be scalable in both small and large increments without requiring a system
redesign or replacement. Ideally, a storage solution should have “pay-as-you-grow”
characteristics that allow for growth as-needed.

Variable File Types and Operations
Life Science data is highly variable, both in composition and in the way that it is accessed.
Therefore, an ideal storage system for Life Science organizations must have good I/O
performance across these varied use cases:

- Many small files or fewer big files
- Text files and binary files
- Sequential and random access
- Highly concurrent access

This variability is common to both neuro-imaging and next-generation sequencing. Massive
simultaneous computations are performed upon many, large primary image files ranging in the
gigabytes and requiring highly parallel streaming (I/O), resulting in fewer, smaller text files. The
resulting data might be kept within directories containing thousands to hundreds-of-thousands of
files, totaling many terabytes.

Shared Read/Write Access
Storage systems for Life Science data must be simultaneously accessible to many instruments,
users, analysis computers, and data servers. These storage systems cannot reside in isolated
silos with limited accessibility. They must, instead, permit concurrent, integrated, file-level
read/write access across the entire organization with I/O bandwidth that scales to accommodate
concurrent demand.

A typical Neuro-Imaging or Next-Generation sequencing workflow involves the following steps:

- Multiple instruments generate primary image data.
- Large memory SMP machines and compute clusters distill the primary data into a derived
form.
- Researchers evaluate and annotate the data to answer scientific questions.
- Researchers iterate on the above process, adding more primary data and refining their
analyses.
- Finally, results are served to a wider audience via internet repositories, usually accessed
via FTP or HTTP.

The requirements of the workflow above are the sum of requirements from instruments,
researchers, computing systems, and customers. A sustainable storage plan for even a small
research organization requires a system with shared, file-level read/write access to a common,
large, scalable storage repository and should allow access by these common protocols:

- NFS (Network File System) – The common network file system for UNIX instruments and
analysis computers
- SMB/CIFS (Server Message Block/Common Internet File System) – The common
network file system for Windows-based instruments and user desktops

ISILON SYSTEMS
4

- HTTP (Hypertext Transfer Protocol) – The file transfer protocol used in the World Wide
Web
- FTP (File Transfer Protocol) – A common internet file transfer protocol for disseminating
data

Ease of Use
At many levels, ease of use is the most significant storage requirement for Life Science research,
even though it is generally the most difficult to quantify.

Management
The human resources involved in maintaining a large storage system range from just above zero
to many FTEs (full time equivalents). The management of an ideal storage system should not
require the hiring of additional, dedicated IT staff.

Scaling
Scaling a storage system’s capacity and/or performance, whether by fractional amounts or by
orders of magnitude, should not require multiple man-months of meetings to plan, or even several
man-days of IT technical expertise to implement. Scaling an ideal storage system should be able
to be performed in minutes, independent of scale.

User
Ideally, the researcher is focused on science, not computers or disks. The researcher shouldn’t
be concerned with or aware of volumes, capacity, formats, or how to access their data. Upon
scaling storage a user might notice that capacity suddenly increased, but never experience an
interruption in service.

Understanding the Alternatives

Common Feature Trade-offs
Like most products, storage solutions compete based on their features. An ideal storage solution
would excel at all features: have high I/O performance rates, never become inaccessible, never
lose data, have the ability to become infinitely large, be scalable in both large and small
increments, have a low purchase price, require little human effort to manage, and be easy to use.

In the real world, decisions are based on which of these requirements are most important within a
given budget.

Storage decisions typically reduce to four factors:

- Will this provide me sufficient performance and capacity for my present needs?
- Will I experience any significant down-time or data loss?
- Do I have the human resources needed to manage the system?
- How long will it be before I need to upgrade this system and at what cost?

When designing storage systems in a scientific research environment, many variables come into
play. Present capacity needs may be the easiest to quantify, but are only a starting point.
Performance requirements aren’t generally known until after the storage has been deployed and
workflows are executed against data. Data loss is known to be a very bad thing, but quantifying
the cost of loss is difficult when the core value to the lab might be a publication or a discovery.
Labor costs may be very indirect; the use of graduate students as part-time systems

ISILON SYSTEMS
5

administrators is a prime example. Students come and go, which can impose high additional
costs if storage systems are difficult to learn or require specialized training. Particularly in primary
research, very early in a product pipeline, it can be difficult to set dollar values on these factors.
However, such variables must be considered in order to make sensible storage infrastructure
decisions.

All storage systems have maximum scaling limits. However with some, once this limit is reached
one must either (a) deploy a 2nd standalone system or (b) perform a “forklift upgrade” in which
major components are retired and replaced with bigger/newer ones.

In addition to the decision parameters above, some storage architectures can be tuned to
optimize certain facets of their performance. This means that they can be configured to excel at
one feature or another, but not all at the same time:

- I/O Performance – The speed of writing data to disk and reading it back to the CPU and
user. This might be further broken down into transactional, sequential, and random
access patterns.
- Availability – The cost associated with maintaining uninterrupted access to the data
- Reliability – The cost associated with mitigating the risk of data loss
- Maximum Scalability – The largest data volume the storage system can ever hold
- Dynamic Configuration – The cost in time and effort to make a change to the system
- Resolution of Scale – The smallest increment by which the storage can be made larger
- Purchase Price – The cost to purchase the system
- Total Cost of Ownership – The cost to buy and operate the system over its useful life
- Ease of Use – The cost in time and effort to get it working and keep it working

Direct Attached Storage (DAS)

Discrete Disk
The local disk method is the most familiar means of providing storage, either directly attached to
the motherboard of a system, or attached through USB, Firewire, SCSI, or Fibre Channel cables.
While it may seem odd (when discussing critical, enterprise-class storage systems) to consider
using discrete local disks, due to the explosion of data and the ubiquity of simple DAS type
storage, labs containing shelves filled with Firewire drives being used as long-term archive
solutions are not that uncommon. These discrete disks can be purchased for as little as $200-
$300 per TB and are fairly easy to use. In the very short term, this may seem to be a reasonable
purchasing decision, coupled with the reagents necessary for an instrument run. However,
combining these disks into a larger storage pool is impossible. Accessing the needed data means
locating the correct disk and plugging it into the correct computer. This is not practical at scale,
even for very small environments. We describe this only as a baseline, and it should not be
considered a reasonable enterprise-class storage solution.

ISILON SYSTEMS
6

Redundant Array of Independent Disks (RAID)
RAID is a technology that combines two or more disks to achieve greater volume capacity,
performance, or reliability. A RAID might be directly attached to one computer as in DAS, or
indirectly by a SAN (Storage Area Network), or provide the underlying storage for a NAS
(Network Attached Storage). With RAID, data is striped across all of the disks to create a “RAID
Set” and depending on how the data is sorted to the individual disks, a RAID is highly tunable:

RAID 0: Maximum capacity, maximum risk
RAID 1: Maximum read performance, minimum risk
RAID 5: Balance capacity, performance, and risk
RAID 6: Capacity and performance, with less risk than RAID 5

Storage Area Network (SAN)

A SAN is an indirect means of attaching disks to a computer. Rather than connecting directly to
the motherboard, a computer connects indirectly through a SAN (typically by Fibre Channel,
iSCSI, or proprietary low-latency network). A SAN provides a convenient means to centralize the
management of many disks into a common storage pool and then permit the allocation of logical
portions of this pool out to one or more computers. This common storage pool can be expanded
by attaching additional disk to the SAN. SAN clients have block-level access (which appears as a
local disk) to “their” portion of the SAN, however they have no access to other portions allocated
to other computers.

ISILON SYSTEMS
7

Network Attached Storage (NAS)

DAS and SAN are methods for attaching disks to a computer, and RAID is a technology for
configuring many disks to achieve different I/O characteristics. A NAS is neither of these. A NAS
is a computer (file server) that contains (or is attached to) one or more local disks or RAIDs and
operates a file-sharing program that provides concurrent, file-level access to many computers
over a TCP/IP network. Therefore, the performance of a NAS is often limited by the number of
clients competing for access to the file server.

NAS/SAN Hybrid
Naturally, one can provide the files served by a NAS through disks provided over a SAN. A
NAS/SAN hybrid is a common approach to overcoming the performance limitations of a NAS by
distributing file server load over multiple file servers attached to a common SAN. Theoretically,
this provides a means to increase the number of file servers in response to the demand of
competing clients. This approach often fails when one of the file servers is serving a SAN volume
with more “popular” data on it. Responding to this failure requires careful monitoring of the
demand and moving/copying/synchronizing data across various SAN volumes. As capacity
grows, the management of a NAS/SAN Hybrid becomes more complex.

Asymmetric Clustered Storage

Asymmetric clustered storage is implemented by a software layer operating on top of a NAS/SAN
hybrid architecture. Storage of many SAN volumes is “clustered,” and allocated to separate
computers, but it is also asymmetric in that nodes have specialized roles in providing the service.
Each NAS file server has concurrent access to all of the SAN volumes, so the manual re-
distribution of NAS servers and SAN volumes is no longer required. This concurrency is managed
through the addition of another type of computer, frequently called the “Meta Data Controller.”
This controller ensures that file servers take proper turns accessing the SAN volumes and files.

ISILON SYSTEMS
8

Symmetric Clustered Storage

Symmetric clustered storage provides a balanced architecture for scalable storage and
concurrent read/write access to a common network file system. It pools both the storage and the
file serving capabilities of many anonymous and identical systems. While individual “nodes” within
the cluster manage their directly attached disks, a software layer allows the nodes to work
together to present an integrated whole. This approach maximizes the ability to “scale-out”
storage by adding nodes. This is the approach offered by Isilon systems, and is described in
greater detail below.

Isilon Clustered Storage Solution

Life Science research presents an ever changing, highly automated, laboratory data environment.
This can be a major challenge for computing and storage vendors. Isilon IQ represents a flexible,
scalable system in which capacity can be added in parallel with new equipment or laboratory
capacity. Once the system is configured, scientists can reasonably expect to spend most of their
time doing science rather than constantly worrying about whether a new sequencing machine will
require a complete overhaul of the storage system. Isilon IQ’s biggest strengths are scalability of
capacity and/or performance linearly or independently, symmetric data access, and ease of use.
Several customers shared that the staffing requirements for their data storage environment
dropped to near zero once they installed the Isilon system.

The Isilon storage product is a fully symmetric, file-based, clustered architecture system. The
hardware consists of industry standard x86 servers that arrive pre-configured with Isilon’s
patented OneFS operating system. These “nodes” connect to each other using a low-latency,
high-bandwidth Infiniband network.

Filesystem clients access data over a pair of Ethernet ports on each node. Clients connect to a
single “virtual” network address and OneFS dynamically distributes the actual connections to the
nodes. This means that input/output bandwidth, caching, and latency are shared across the entire

ISILON SYSTEMS
9

system. Performance scales smoothly as nodes are added, in contrast to gateway based master-
slave architectures where the gateway inevitably becomes a bottleneck. Because all connections
go to the same virtual address, adding or removing nodes from the system requires no client-side
reconfiguration. Users simply continue with the same data volume, but with more capacity.
OneFS manages all aspects of the system including detecting and optimizing new nodes. This
makes configuration and expansion incredibly simple. All the usual complexity associated with
adding new storage is an “under the hood” activity managed by the operating system.

Administrators add new nodes to an existing cluster by connecting the network and power cables
and then powering on. The nodes detect an existing cluster and make themselves available to it.
OneFS then smoothly re-balances data across all the nodes. During this process, all filesystem
mount points remain available. Compared with the usual headache of taking down an entire data
storage system for days at a time (to migrate existing data off to tapes, create a new larger
system, and re-import the data), this is a fairly incredible benefit.

Isilon nodes of different ages and sizes can be intermixed. This eliminates one of the major risks
of investing in an integrated storage system. Computing hardware doubles in capacity and speed
on a regular basis, meaning that any system intended for use over more than a year or two must
take into account that expansions may consist of substantially different hardware than the original
system. Storage architectures without the capability to smoothly support heterogeneous hardware
frequently require a “forklift” upgrade, in which the old system is decommissioned to make space
for a new one. Organizations are then in a constant state of flux - migrating data, with the
associated chaos in both user access and automated pipelines. With an Isilon system, by
contrast, older nodes may remain in service as long as their hardware is functional. Since data
management and re-balancing are managed by OneFS, when it is time to retire a storage node,
administrators simply instruct the system to migrate the data off of that node, wait for the
migration to complete, and then turn off and remove it from the cluster. During migration all data
is available for user access.

OneFS Operating System
To address the scalable distributed file system requirement of clustered storage, Isilon built,
patented, and delivered the revolutionary OneFS operating system, which combines three layers
of traditional storage architectures – the file system, volume manager and RAID controller – into a
unified software layer. This creates a single intelligent, fully symmetrical file system which spans
all nodes within a cluster.

File striping in the cluster takes place across multiple storage nodes versus the traditional method
of striping across individual disks within a volume/RAID array. This provides a very specific
benefit: no one node is 100% responsible for any particular file. An Isilon IQ system can
withstand the loss of multiple disks or entire nodes without losing access to any data. OneFS
provides each node with knowledge of the entire file system layout. Accessing any independent
node gives a user access to all content in one unified namespace, meaning that there are no
volumes or shares, no inflexible volume size limits, no downtime for reconfiguration or expansion
of storage and no multiple network drives to manage. This seamless integration of a fully
symmetric clustered architecture makes Isilon unique in the storage landscape.

Inherent High Availability & Reliability
Non-symmetric data storage architectures contain intrinsic dependencies and create points of
failure and bottlenecks. One way to ensure data integrity and eliminate single points of failure is
to make all nodes in a cluster peers. Each node in an Isilon cluster can handle a request from any
client, and can provide any content. If any particular node fails, the other nodes dynamically fill in.
This is true for all the tasks of the filesystem. Because there is no dedicated “head” to the system,
all aspects of system performance can be balanced across the entire system. This means that
I/O, availability, storage capacity, and resilience to hardware failure are all distributed.

ISILON SYSTEMS
10

OneFS has further increased the availability of Isilon’s clustered storage solution by providing
multi-failure support of n+4. Simply stated, this means an Isilon cluster can withstand the
simultaneous loss of an unprecedented four disks or four entire nodes without losing access to
any content – and without requiring dedicated parity drives. Additionally, self-healing capabilities
and high levels of hardware redundancy greatly reduce the chances of a production node failing
in the first place.

In the event of a failure, OneFS automatically re-builds files across all of the existing distributed
free space in the cluster in parallel, eliminating the need for dedicated “hot” spare drives. OneFS
leverages all available free space across all nodes in the cluster to rebuild data. This minimizes
the window of vulnerability during the rebuild process.

Isilon IQ constantly monitors the health of all files and disks and maintains records of the smart
statistics (e.g. recoverable read errors) available on each drive to anticipate when that drive will
fail. This is simply an automation of a “best practice” that many system administrators wish they
had time to do. When Isilon IQ identifies at risk components, it preemptively migrates data off of
the “at risk” disk to available free space on the cluster in a manner that is both automatic and
transparent to the user. Once the data is migrated, the administrator is notified to service the
suspect drive in advance of actual failure. This feature provides customers with confidence that
data written today will be stored 100 percent reliably, and available whenever it is needed. No
other solution today provides this level of data protection and reliability.

Single Level of Management
An Isilon IQ cluster creates a single, shared pool of all content, providing one point of access for
users and one point of management for administrators, with a single file system or volume of up
to 1.6 petabytes. Users can connect to any storage node and securely access all of the content
within the entire cluster. This means there is single point for all applications to connect to and that
every application has visibility and access to every file in the entire file system (per security and
permission policies, of course).

Linear Scalability in Performance & Capacity
One of the key benefits of an Isilon IQ cluster is the ease with which it allows administrators to
add both performance and capacity without downtime or application changes. System
administrators simply insert a new Isilon IQ storage node, connect the network cables and power
up. The cluster automatically detects the newly added storage node and begins to configure it as
a member of the cluster. In less than 60 seconds, an administrator can grow a single file system
by 2 to 12 terabytes, and increase throughput performance an additional 2 Gigabits/second.
Isilon’s modular approach offers a building block (or “pay-as-you-grow”) solution so customers
aren’t forced to buy more storage capacity than is needed up front. The modular design of an
Isilon cluster also enables customers to incorporate new technologies in the same cluster, such
as adding a node with higher-density disk drives, more CPU horsepower or more total
performance.

Conclusion

We have described the data storage needs of Life Science researchers, summarized the major
data storage architectures currently in use, and presented the Isilon IQ product as a strong and
flexible solution to those needs. Data storage is a critical component of modern scientific
research. As smaller labs and individual researchers become responsible for terabytes and
petabytes of data, understanding the options and trade-offs will become ever more critical.

ISILON SYSTEMS
11

Managing Rapidly Growing Science Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (12)

Andere mochten auch

Andere mochten auch (10)

Ähnlich wie Managing Rapidly Growing Science Data

Ähnlich wie Managing Rapidly Growing Science Data (20)

Managing Rapidly Growing Science Data