SlideShare ist ein Scribd-Unternehmen logo
1 von 7
Downloaden Sie, um offline zu lesen
Exploring EMC Isilon scale-out storage solutions

Hadoop’s Rise
in Life Sciences
By John Russell, Contributing Editor, Bio•IT World




Produced by Cambridge Healthtech Media Group
By now the ‘Big Data’ challenge is familiar to the entire life sciences
community. Modern high-throughput experimental technologies generate                       The Hadoop Distributed File
vast data sets that can only be tackled with high performance computing
(HPC). Genomics, of course, is the leading example. At the end of 2011,                    System (HDFS) and compute
global annual sequencing capacity was estimated at 13 quadrillion                          framework (MapReduce)
bases and growing rapidly1. It’s worth noting a single base pair typically
represents about 100 bytes of data (raw, analyzed, and interpreted).                       enable Hadoop to break
                                                                                           extremely large data sets
The need to manage and analyze these massive data sets, not just in life
sciences but throughout all of science and industry, has spurred many new                  into chunks, to distribute/
approaches to HPC infrastructure and led to many important IT advances,                    store (Map) those chunks
particularly in distributed computing. While there isn’t a single right
answer, one approach – the Hadoop storage and compute framework – is                       to nodes in a cluster, and
emerging as a compelling contender for use in life sciences to cope with the               to gather (Reduce) results
deluge of data.
                                                                                           following computation.
Created in 2004 by Doug Cutting (who famously named it after his son’s
stuffed elephant) and elevated to a top-level Apache Foundation project
in 2008, Hadoop is intended to run large-scale distributed data analysis
on commodity clusters. Cutting was initially inspired by a paper2 from
Google Labs describing Google’s BigTable infrastructure and MapReduce
application layers. (For a detailed perspective see Ronald Taylor’s, An
overview of the Hadoop/MapReduce/HBase framework and its current
applications in bioinformatics.3)

Broadly, Hadoop uses a file system (Hadoop Distributed File System
(HDFS) and framework software (MapReduce) to break extremely large
data sets into chunks, to distribute/store (Map) those chunks to nodes in
a cluster, and to gather (Reduce) results following computation. Hadoop’s
distinguishing feature is it automatically stores the chunks of data on the
same nodes on which they will be processed. This strategy of co-locating
of data and processing power (proximity computing) significantly
accelerates performance and in April 2008 a Hadoop program, running
on 910-node cluster, broke a world record, sorting a terabyte of data in
less than 3.5 minutes.4




1	 DNA Sequencing Caught in Deluge of Data”, New York Times, Nov. 30, 2011, http://www.nytimes.com/2011/12/01/business/dna-
   sequencing-caught-in-deluge-of-data.html?_r=1&ref=science

2	 OSDI’04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004, http://research.
   google.com/archive/mapreduce.html

3	 An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics, http://www.ncbi.nlm.nih.gov/
   pmc/articles/PMC3040523/

4	 “Hadoop wins Terabyte sort benchmark”, Apr 2008, Apr. 2009, http://sortbenchmark.org/YahooHadoop.pdf last accessed Dec 2011



                                                                                             Hadoop’s Rise in Life Sciences | 2
Part of the improved performance stems from MapReduce’s key:value
programming model which speeds up and scales up parallelized                                        It turns out that Hadoop – a
“job” execution better than many alternatives such as the GridEngine
architecture for High Performance Computing (HPC). (One of the earliest                             fault-tolerant, share-nothing
use-cases of the Sun GridEngine5 HPC was the DNA sequence comparison                                architecture in which tasks
BLAST search.) The MapReduce layer is a batch query processor with
dynamic data schema and linear scaling for unstructured or semi-                                    must have no dependence
structured data. Its data is not “normalized” (decomposition of data                                on each other – is an
into smaller structured relationships). Therefore higher level interpreted
programming languages like Ruby and Python and a compiled language                                  excellent choice for many
like C++ provide easier access to MapReduce to represent the program as                             life sciences applications.
MapReduce “jobs”.

Standard Hadoop interfaces are available via Java, C, FUSE and WebDAV.
The Hadoop R (statistical language) interface, RHIPE, is also popular in the
life sciences community.

It turns out that Hadoop – a fault-tolerant, share-nothing architecture
in which tasks must have no dependence on each other – is an
excellent choice for many life sciences applications. This is largely
because so much of life sciences data is semi- or unstructured file-
based data and ideally suited for ‘embarrassingly parallel’ computation.
Moreover, the use of commodity hardware (e.g. Linux cluster) keeps
cost down, and little or no hardware modification is required6.

Not surprisingly life sciences organizations were among Hadoop’s
earliest adopters. The first large-scale MapReduce project was
initiated by the Broad Institute (in 2008) and resulted in the
comprehensive Genome Analysis Tool Kit (GATK)7. The Hadoop
“CrossBow” project from Johns Hopkins University came soon after8.




5	 Altschul SF, et al, “Basic local alignment search tool”. J Mol Biol 215 (3): 403–410, October 1990.
6	 An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics, http://www.ncbi.nlm.nih.gov/
   pmc/articles/PMC3040523/

7	 McKenna A, et al, “The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data”,
   Genome Research, 20:1297–1303, July 2010.

8	 http://bowtie-bio.sourceforge.net/crossbow/index.shtml


                                                                                                         Hadoop’s Rise in Life Sciences | 3
Here are a few current Hadoop-based bioinformatics applications9:
   •	 Crossbow. Whole genome resequencing analysis; SNP
       genotyping from short reads.

   •	 Contrail. De novo assembly from short sequencing reads.

   •	 Myrna. Ultrafast short read alignment and differential gene
       expression from large RNA-seq data sets.

   •	 PeakRanger. Cloud-enabled peak caller for ChIP-seq data.

   •	 Quake. Quality-aware detection and sequencing error
       correction tool.

   •	 BlastReduce. High-performance short read mapping.

   •	 CloudBLAST. Hadoop implementation of NCBI’s Blast.

   •	 MrsRF. Algorithm for analyzing large evolutionary trees.
(For a more detailed example of Hadoop in operation see sidebar,
Genomics Example: Calling SNPs with Crossbow.)


   Genomics Example: Calling SNPs with CrossBow
   Next Generation Sequencers (NGS) like Illumina Hiseq can produce data in the
   order of 200 billion base pairs (200 Gbp) in a single one-week run for a 60x human
   genome coverage, which means that each base was present on an average of
   60 reads. The larger the coverage, the more statistically significant is the result.
   Sequence reads are much shorter than traditional “Sanger” sequencing. This data
   requires specialized software algorithms called “short read aligners”.
   CrossBow is a combination of several algorithms that provide SNP calling and
   short read alignment, which are common tasks in NGS. Figure 1 alongside
   explains the steps necessary to process genome data to look for SNPs. The
   Map-Sort-Reduce process is ideally suited for a Hadoop framework. The cluster
   as shown is a traditional N-node Hadoop cluster. All of the Hadoop features
   like HDFS, program management and fault tolerance are available.
   The Map step is the short read alignment algorithm, called BoWTie (named
   after the Burrows Wheeler Transform, BWT). Multiple instances of BoWTie are
   run in parallel in Hadoop. The input tuples (an ordered list of elements) are the
   sequence reads and the output tuples are the alignments of the short reads.
   The Sort step apportions the alignments according to a primary key (the
   genome partition) and sorts based on a secondary key (which is the offset for
   that partition). The data here are the sorted alignments.
   The Reduce step calls SNPs for each reference genome partition. Many
   parallel instances of the algorithm SOAPsnp (Short Oligonucleotide Analysis
   Package for SNP) run in the cluster. Input tuples are sorted alignments for a
   partition and the output tuples are SNP calls. Results are stored via HDFS, and
   then archived in SOAPsnp format.



9	 Got Hadoop?, Sept. 2011, Genome Technology, http://www.genomeweb.com/informatics/got-hadoop


                                                                                          Hadoop’s Rise in Life Sciences | 4
After several years of steady development in academic environments,
Hadoop is now poised for rapid commercialization and broader                                      “Hadoop meets all the tenets
uptake in biopharma and healthcare. Early adoption has been
strongest among next generation sequencing (NGS) centers where                                    of Jim Gray’s Laws of Data
NGS workflows can generate 2 TeraBytes (TB) of data per run per                                   Engineering which have not
week per sequencer – that’s not including the raw images. For these                               changed in 15 years.”
organizations, the need for scale-out storage that integrates with
HPC is a line item requirement.                                                                   Sanjay Joshi
                                                                                                  CTO, Life Sciences,
                                                                                                  EMC Isilon Storage Division
EMC ® Isilon ®, long a leader in scale-out NAS storage solutions,
understands these challenges and has provided the scale-out storage
for nearly all the workflows for all the DNA sequencer instrument
manufacturers in the market today at more than 150 customers.
Since 2008, the EMC Isilon OneFS ® storage platform has an overall
installed base of more than 65 PetaBytes (PB). Recently, EMC
introduced the industry’s first scale-out NAS system with native
Hadoop support (via HDFS).

The EMC Isilon OneFS file system now provides for connectivity to
the Hadoop Distributed File System (HDFS) just like any other shared
file system protocol: NFS, CIFS or SMB10. This allows for the data
co-location of the storage with its compute nodes using the standard
higher-level Java application programming interface (API) to build
MapReduce “jobs”. EMC has gone one step further by combining its
OneFS-based NAS solution with EMC Greenplum ® HD, a powerful
analytics platform, to create a Hadoop appliance. Together, the two
offerings relieve users of the burden of cobbling together various open
source Hadoop components, which sometimes proves problematic.

“Hadoop meets all the tenets of Jim Gray’s Laws of Data
Engineering11 which have not changed in 15 years,” says Sanjay
Joshi, CTO, Life Sciences, EMC Isilon Storage Division. Those tenets
include: scientific computing is very data intensive, with no real
limits; the solution is a scale-out architecture with distributed data
access; and bring computation to the data, rather than data to the
computations.”




10	 Hadoop on EMC Isilon Scale Out NAS: EMC White Paper, Part Number h10528
11	 From Jim Gray, “Scalable Computing”, presentation at Nortel: Microsoft Research, April 1999


                                                                                                    Hadoop’s Rise in Life Sciences | 5
“Isilon built the industry’s first Scale Out storage architecture. Now
with its native and enterprise-ready HDFS protocol via OneFS and
GreenPlum HD, EMC brings simplicity to Big Data in Science.”
says Joshi.

EMC Isilon OneFS combines the three layers of traditional storage
architectures—the file system, volume manager, and RAID—into
one unified software layer, creating a single intelligent distributed
file system that runs on one storage cluster. Important advantages of
OneFS for Hadoop are:

   •	 Scalable: Linear scale with increasing capacity – from 18TB
      to 16PB in a single filesystem and a single global namespace.
      Scale out as needs grow, independent of the compute layer.
   •	 Predictable: Dynamic content balancing is performed as
      nodes are added, upgraded or capacity changes. No added
      management time is required since this process is simple.
                                                                         Storage tiers without fears based
   •	 Available: OneFS protects your data from power loss, node          on performance reside in one global
      or disk failures, loss of quorum and storage rebuild by            namespace, connected via a dedicated
                                                                         backend network.
      distributing data, metadata and parity across all nodes. It
      also eliminates the single point of failure of a Hadoop “Name
      Node”. Therefore OneFS is “self healing”.
   •	 Efficient: Compared to the average 50% efficiency of
      traditional RAID systems, OneFS provides over 80%
      efficiency, independent of CPU compute or cache. This
      efficiency is achieved by ‘tier’ing the process into three types
      as shown in the figure alongside and by the pools within
      these node types. This efficiency extends to the reduction
      from a 3x copy that Hadoop requires to the >80% efficient 1x
      storage via EMC Isilon’s HDFS protocol.
   •	 Enterprise-ready. Administration of the storage clusters is
      via an intuitive Web based UI. Connectivity to your process
      is through standard file protocols: CIFS, SMB, NFS, FTP/
      HTTP, iSCSI and HDFS. Standardized authentication and
      access control is available at scale: AD, LDAP and NIS.




                                                                           Hadoop’s Rise in Life Sciences | 6
CONCLUSION
What began as an internal project at Google in 2004 has now
matured into a scalable framework for two computing paradigms
that are particularly suited for the life sciences: parallelization and
distribution. Indeed, the post-processing streaming data patterns for
text strings, clustering and sorting – the core process patterns in the
life sciences – are ideal workflows for Hadoop.

Case-in-point: The CrossBow example cited earlier aligned Illumina
NGS reads for SNP calling over a ‘35x’ coverage of the human genome in
under 3 hours using a 40-node Hadoop cluster; an order of magnitude
better than traditional HPC technology for parallel processes.

The EMC Isilon OneFS distributed file system handles the Hadoop
distributed file system, HDFS, just like any other shared file system,
and provides a shield for the single point of failure in Hadoop: the
name node. The Hybrid Cloud model (source data mirror) with
Hadoop as a Service (HaaS) is the current state-of-the-art. For more
information visit EMC Isilon at http://www.emc.com/isilon.




  Summary of Hadoop Attributes:
  Overview
  •	Write Once Read Many times (WORM)
  •	Co-locates data with compute, uses higher level architecture with Java API
  •	HDFS is a distributed file system that runs on large clusters
  Advantages
  •	Uses MapReduce framework – a batch query processor, scales linearly
  •	EMC Isilon OneFS implements HDFS and eliminates the single point of failure, the “name node”
  •	Standard programming language development: Java, Ruby, Python, C++ create MapReduce jobs. FUSE and
    WebDAV interfaces provide architectural flexibility
  Challenges
  •	HDFS block size is 128 MB (can be increased), therefore large numbers of small files (<8KB) reduce its
    performance: use Hadoop Archive (HAR)
  •	Data coherency and latency remain issues for large scale implementations
  •	Not suited for low-latency, “in process” use-cases like real-time, spectral or video analysis
  •	Data transfer between Genome sequencing data sources to the Hadoop clusters in the Cloud remains an issue,
    the current business model is mirroring the data between source and Cloud and then utilizing Hadoop as a
    Service model on the mirrored data.



                                                                                 Hadoop’s Rise in Life Sciences | 7

Weitere ähnliche Inhalte

Was ist angesagt?

The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryThe Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryIntel IT Center
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformaticsc.titus.brown
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at YorkMing Li
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckPistoia Alliance
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangementGuy Coates
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use casesGuy Coates
 
Machine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsMachine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsLarry Smarr
 
A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017Manish K Patel
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomicsGuy Coates
 
The Rise of Machine Intelligence
The Rise of Machine IntelligenceThe Rise of Machine Intelligence
The Rise of Machine IntelligenceLarry Smarr
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 

Was ist angesagt? (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific DiscoveryThe Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
The Gordon Data-intensive Supercomputer. Enabling Scientific Discovery
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
Advanced Research Computing at York
Advanced Research Computing at YorkAdvanced Research Computing at York
Advanced Research Computing at York
 
FC Brochure & Insert
FC Brochure & InsertFC Brochure & Insert
FC Brochure & Insert
 
Opportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deckOpportunities for HPC in pharma R&D - main deck
Opportunities for HPC in pharma R&D - main deck
 
Whither Small Data?
Whither Small Data?Whither Small Data?
Whither Small Data?
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Machine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsMachine Learning in Healthcare Diagnostics
Machine Learning in Healthcare Diagnostics
 
A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
The Rise of Machine Intelligence
The Rise of Machine IntelligenceThe Rise of Machine Intelligence
The Rise of Machine Intelligence
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 

Andere mochten auch

Risk Intelligence: Harnessing Risk, Exploiting Opportunity
Risk Intelligence: Harnessing Risk, Exploiting OpportunityRisk Intelligence: Harnessing Risk, Exploiting Opportunity
Risk Intelligence: Harnessing Risk, Exploiting OpportunityEMC
 
Cultural rev friday
Cultural rev fridayCultural rev friday
Cultural rev fridayTravis Klein
 
Tues exploration pre quiz + columbus
Tues exploration pre quiz + columbusTues exploration pre quiz + columbus
Tues exploration pre quiz + columbusTravis Klein
 
Full-time Prospectus 2012/13
Full-time Prospectus 2012/13Full-time Prospectus 2012/13
Full-time Prospectus 2012/13HelenTY
 
Wed quiz and communism
Wed quiz and communismWed quiz and communism
Wed quiz and communismTravis Klein
 
RSA MONTHLY FRAUD REPORT - September 2014
RSA MONTHLY FRAUD REPORT - September 2014RSA MONTHLY FRAUD REPORT - September 2014
RSA MONTHLY FRAUD REPORT - September 2014EMC
 
Day 1 FW Summer 2014
Day 1 FW Summer 2014Day 1 FW Summer 2014
Day 1 FW Summer 2014Travis Klein
 
Pivotal tc server_wp_migrating_jee_apps_042313
Pivotal tc server_wp_migrating_jee_apps_042313Pivotal tc server_wp_migrating_jee_apps_042313
Pivotal tc server_wp_migrating_jee_apps_042313EMC
 
Day 5 race & slavery
Day 5 race & slaveryDay 5 race & slavery
Day 5 race & slaveryTravis Klein
 
EMC Isilon: A Scalable Storage Platform for Big Data
EMC Isilon: A Scalable Storage Platform for Big DataEMC Isilon: A Scalable Storage Platform for Big Data
EMC Isilon: A Scalable Storage Platform for Big DataEMC
 
Swipp Plus Quick Start Guide
Swipp Plus Quick Start GuideSwipp Plus Quick Start Guide
Swipp Plus Quick Start GuideSwipp
 
OpenStack Swift Object Storage on EMC Isilon Scale-Out NAS
OpenStack Swift Object Storage on EMC Isilon Scale-Out NASOpenStack Swift Object Storage on EMC Isilon Scale-Out NAS
OpenStack Swift Object Storage on EMC Isilon Scale-Out NASEMC
 
The Evolution of IP Storage and Its Impact on the Network
The Evolution of IP Storage and Its Impact on the NetworkThe Evolution of IP Storage and Its Impact on the Network
The Evolution of IP Storage and Its Impact on the NetworkEMC
 
Food for dogs
Food for dogsFood for dogs
Food for dogsJohn1213
 
Wealth creation and academic health science networks emc aridhia and pivotal 0
Wealth creation and academic health science networks emc aridhia and pivotal  0Wealth creation and academic health science networks emc aridhia and pivotal  0
Wealth creation and academic health science networks emc aridhia and pivotal 0EMC
 

Andere mochten auch (20)

Risk Intelligence: Harnessing Risk, Exploiting Opportunity
Risk Intelligence: Harnessing Risk, Exploiting OpportunityRisk Intelligence: Harnessing Risk, Exploiting Opportunity
Risk Intelligence: Harnessing Risk, Exploiting Opportunity
 
Cultural rev friday
Cultural rev fridayCultural rev friday
Cultural rev friday
 
The ant
The antThe ant
The ant
 
Titanic
TitanicTitanic
Titanic
 
Tues exploration pre quiz + columbus
Tues exploration pre quiz + columbusTues exploration pre quiz + columbus
Tues exploration pre quiz + columbus
 
Full-time Prospectus 2012/13
Full-time Prospectus 2012/13Full-time Prospectus 2012/13
Full-time Prospectus 2012/13
 
Wed quiz and communism
Wed quiz and communismWed quiz and communism
Wed quiz and communism
 
RSA MONTHLY FRAUD REPORT - September 2014
RSA MONTHLY FRAUD REPORT - September 2014RSA MONTHLY FRAUD REPORT - September 2014
RSA MONTHLY FRAUD REPORT - September 2014
 
Day 1 FW Summer 2014
Day 1 FW Summer 2014Day 1 FW Summer 2014
Day 1 FW Summer 2014
 
Pivotal tc server_wp_migrating_jee_apps_042313
Pivotal tc server_wp_migrating_jee_apps_042313Pivotal tc server_wp_migrating_jee_apps_042313
Pivotal tc server_wp_migrating_jee_apps_042313
 
Day 5 race & slavery
Day 5 race & slaveryDay 5 race & slavery
Day 5 race & slavery
 
Process
ProcessProcess
Process
 
EMC Isilon: A Scalable Storage Platform for Big Data
EMC Isilon: A Scalable Storage Platform for Big DataEMC Isilon: A Scalable Storage Platform for Big Data
EMC Isilon: A Scalable Storage Platform for Big Data
 
Project info
Project infoProject info
Project info
 
Swipp Plus Quick Start Guide
Swipp Plus Quick Start GuideSwipp Plus Quick Start Guide
Swipp Plus Quick Start Guide
 
OpenStack Swift Object Storage on EMC Isilon Scale-Out NAS
OpenStack Swift Object Storage on EMC Isilon Scale-Out NASOpenStack Swift Object Storage on EMC Isilon Scale-Out NAS
OpenStack Swift Object Storage on EMC Isilon Scale-Out NAS
 
Presentation1
Presentation1Presentation1
Presentation1
 
The Evolution of IP Storage and Its Impact on the Network
The Evolution of IP Storage and Its Impact on the NetworkThe Evolution of IP Storage and Its Impact on the Network
The Evolution of IP Storage and Its Impact on the Network
 
Food for dogs
Food for dogsFood for dogs
Food for dogs
 
Wealth creation and academic health science networks emc aridhia and pivotal 0
Wealth creation and academic health science networks emc aridhia and pivotal  0Wealth creation and academic health science networks emc aridhia and pivotal  0
Wealth creation and academic health science networks emc aridhia and pivotal 0
 

Ähnlich wie Whitepaper : CHI: Hadoop's Rise in Life Sciences

Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterIOSR Journals
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopGiovanna Roda
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptxTazeenSayed3
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 

Ähnlich wie Whitepaper : CHI: Hadoop's Rise in Life Sciences (20)

Hadoop.powerpoint.pptx
Hadoop.powerpoint.pptxHadoop.powerpoint.pptx
Hadoop.powerpoint.pptx
 
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop ClusterParalyzing Bioinformatics Applications Using Conducive Hadoop Cluster
Paralyzing Bioinformatics Applications Using Conducive Hadoop Cluster
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Hadoop
HadoopHadoop
Hadoop
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
HDFS
HDFSHDFS
HDFS
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
62_Tazeen_Sayed_Hadoop_Ecosystem.pptx
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 

Mehr von EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremioEMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereEMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewEMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeEMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityEMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsEMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookEMC
 

Mehr von EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Kürzlich hochgeladen

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Kürzlich hochgeladen (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Whitepaper : CHI: Hadoop's Rise in Life Sciences

  • 1. Exploring EMC Isilon scale-out storage solutions Hadoop’s Rise in Life Sciences By John Russell, Contributing Editor, Bio•IT World Produced by Cambridge Healthtech Media Group
  • 2. By now the ‘Big Data’ challenge is familiar to the entire life sciences community. Modern high-throughput experimental technologies generate The Hadoop Distributed File vast data sets that can only be tackled with high performance computing (HPC). Genomics, of course, is the leading example. At the end of 2011, System (HDFS) and compute global annual sequencing capacity was estimated at 13 quadrillion framework (MapReduce) bases and growing rapidly1. It’s worth noting a single base pair typically represents about 100 bytes of data (raw, analyzed, and interpreted). enable Hadoop to break extremely large data sets The need to manage and analyze these massive data sets, not just in life sciences but throughout all of science and industry, has spurred many new into chunks, to distribute/ approaches to HPC infrastructure and led to many important IT advances, store (Map) those chunks particularly in distributed computing. While there isn’t a single right answer, one approach – the Hadoop storage and compute framework – is to nodes in a cluster, and emerging as a compelling contender for use in life sciences to cope with the to gather (Reduce) results deluge of data. following computation. Created in 2004 by Doug Cutting (who famously named it after his son’s stuffed elephant) and elevated to a top-level Apache Foundation project in 2008, Hadoop is intended to run large-scale distributed data analysis on commodity clusters. Cutting was initially inspired by a paper2 from Google Labs describing Google’s BigTable infrastructure and MapReduce application layers. (For a detailed perspective see Ronald Taylor’s, An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics.3) Broadly, Hadoop uses a file system (Hadoop Distributed File System (HDFS) and framework software (MapReduce) to break extremely large data sets into chunks, to distribute/store (Map) those chunks to nodes in a cluster, and to gather (Reduce) results following computation. Hadoop’s distinguishing feature is it automatically stores the chunks of data on the same nodes on which they will be processed. This strategy of co-locating of data and processing power (proximity computing) significantly accelerates performance and in April 2008 a Hadoop program, running on 910-node cluster, broke a world record, sorting a terabyte of data in less than 3.5 minutes.4 1 DNA Sequencing Caught in Deluge of Data”, New York Times, Nov. 30, 2011, http://www.nytimes.com/2011/12/01/business/dna- sequencing-caught-in-deluge-of-data.html?_r=1&ref=science 2 OSDI’04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004, http://research. google.com/archive/mapreduce.html 3 An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics, http://www.ncbi.nlm.nih.gov/ pmc/articles/PMC3040523/ 4 “Hadoop wins Terabyte sort benchmark”, Apr 2008, Apr. 2009, http://sortbenchmark.org/YahooHadoop.pdf last accessed Dec 2011 Hadoop’s Rise in Life Sciences | 2
  • 3. Part of the improved performance stems from MapReduce’s key:value programming model which speeds up and scales up parallelized It turns out that Hadoop – a “job” execution better than many alternatives such as the GridEngine architecture for High Performance Computing (HPC). (One of the earliest fault-tolerant, share-nothing use-cases of the Sun GridEngine5 HPC was the DNA sequence comparison architecture in which tasks BLAST search.) The MapReduce layer is a batch query processor with dynamic data schema and linear scaling for unstructured or semi- must have no dependence structured data. Its data is not “normalized” (decomposition of data on each other – is an into smaller structured relationships). Therefore higher level interpreted programming languages like Ruby and Python and a compiled language excellent choice for many like C++ provide easier access to MapReduce to represent the program as life sciences applications. MapReduce “jobs”. Standard Hadoop interfaces are available via Java, C, FUSE and WebDAV. The Hadoop R (statistical language) interface, RHIPE, is also popular in the life sciences community. It turns out that Hadoop – a fault-tolerant, share-nothing architecture in which tasks must have no dependence on each other – is an excellent choice for many life sciences applications. This is largely because so much of life sciences data is semi- or unstructured file- based data and ideally suited for ‘embarrassingly parallel’ computation. Moreover, the use of commodity hardware (e.g. Linux cluster) keeps cost down, and little or no hardware modification is required6. Not surprisingly life sciences organizations were among Hadoop’s earliest adopters. The first large-scale MapReduce project was initiated by the Broad Institute (in 2008) and resulted in the comprehensive Genome Analysis Tool Kit (GATK)7. The Hadoop “CrossBow” project from Johns Hopkins University came soon after8. 5 Altschul SF, et al, “Basic local alignment search tool”. J Mol Biol 215 (3): 403–410, October 1990. 6 An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics, http://www.ncbi.nlm.nih.gov/ pmc/articles/PMC3040523/ 7 McKenna A, et al, “The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data”, Genome Research, 20:1297–1303, July 2010. 8 http://bowtie-bio.sourceforge.net/crossbow/index.shtml Hadoop’s Rise in Life Sciences | 3
  • 4. Here are a few current Hadoop-based bioinformatics applications9: • Crossbow. Whole genome resequencing analysis; SNP genotyping from short reads.
 • Contrail. De novo assembly from short sequencing reads.
 • Myrna. Ultrafast short read alignment and differential gene expression from large RNA-seq data sets.
 • PeakRanger. Cloud-enabled peak caller for ChIP-seq data.
 • Quake. Quality-aware detection and sequencing error correction tool.
 • BlastReduce. High-performance short read mapping.
 • CloudBLAST. Hadoop implementation of NCBI’s Blast.
 • MrsRF. Algorithm for analyzing large evolutionary trees. (For a more detailed example of Hadoop in operation see sidebar, Genomics Example: Calling SNPs with Crossbow.) Genomics Example: Calling SNPs with CrossBow Next Generation Sequencers (NGS) like Illumina Hiseq can produce data in the order of 200 billion base pairs (200 Gbp) in a single one-week run for a 60x human genome coverage, which means that each base was present on an average of 60 reads. The larger the coverage, the more statistically significant is the result. Sequence reads are much shorter than traditional “Sanger” sequencing. This data requires specialized software algorithms called “short read aligners”. CrossBow is a combination of several algorithms that provide SNP calling and short read alignment, which are common tasks in NGS. Figure 1 alongside explains the steps necessary to process genome data to look for SNPs. The Map-Sort-Reduce process is ideally suited for a Hadoop framework. The cluster as shown is a traditional N-node Hadoop cluster. All of the Hadoop features like HDFS, program management and fault tolerance are available. The Map step is the short read alignment algorithm, called BoWTie (named after the Burrows Wheeler Transform, BWT). Multiple instances of BoWTie are run in parallel in Hadoop. The input tuples (an ordered list of elements) are the sequence reads and the output tuples are the alignments of the short reads. The Sort step apportions the alignments according to a primary key (the genome partition) and sorts based on a secondary key (which is the offset for that partition). The data here are the sorted alignments. The Reduce step calls SNPs for each reference genome partition. Many parallel instances of the algorithm SOAPsnp (Short Oligonucleotide Analysis Package for SNP) run in the cluster. Input tuples are sorted alignments for a partition and the output tuples are SNP calls. Results are stored via HDFS, and then archived in SOAPsnp format. 9 Got Hadoop?, Sept. 2011, Genome Technology, http://www.genomeweb.com/informatics/got-hadoop Hadoop’s Rise in Life Sciences | 4
  • 5. After several years of steady development in academic environments, Hadoop is now poised for rapid commercialization and broader “Hadoop meets all the tenets uptake in biopharma and healthcare. Early adoption has been strongest among next generation sequencing (NGS) centers where of Jim Gray’s Laws of Data NGS workflows can generate 2 TeraBytes (TB) of data per run per Engineering which have not week per sequencer – that’s not including the raw images. For these changed in 15 years.” organizations, the need for scale-out storage that integrates with HPC is a line item requirement. Sanjay Joshi CTO, Life Sciences, EMC Isilon Storage Division EMC ® Isilon ®, long a leader in scale-out NAS storage solutions, understands these challenges and has provided the scale-out storage for nearly all the workflows for all the DNA sequencer instrument manufacturers in the market today at more than 150 customers. Since 2008, the EMC Isilon OneFS ® storage platform has an overall installed base of more than 65 PetaBytes (PB). Recently, EMC introduced the industry’s first scale-out NAS system with native Hadoop support (via HDFS). The EMC Isilon OneFS file system now provides for connectivity to the Hadoop Distributed File System (HDFS) just like any other shared file system protocol: NFS, CIFS or SMB10. This allows for the data co-location of the storage with its compute nodes using the standard higher-level Java application programming interface (API) to build MapReduce “jobs”. EMC has gone one step further by combining its OneFS-based NAS solution with EMC Greenplum ® HD, a powerful analytics platform, to create a Hadoop appliance. Together, the two offerings relieve users of the burden of cobbling together various open source Hadoop components, which sometimes proves problematic. “Hadoop meets all the tenets of Jim Gray’s Laws of Data Engineering11 which have not changed in 15 years,” says Sanjay Joshi, CTO, Life Sciences, EMC Isilon Storage Division. Those tenets include: scientific computing is very data intensive, with no real limits; the solution is a scale-out architecture with distributed data access; and bring computation to the data, rather than data to the computations.” 10 Hadoop on EMC Isilon Scale Out NAS: EMC White Paper, Part Number h10528 11 From Jim Gray, “Scalable Computing”, presentation at Nortel: Microsoft Research, April 1999 Hadoop’s Rise in Life Sciences | 5
  • 6. “Isilon built the industry’s first Scale Out storage architecture. Now with its native and enterprise-ready HDFS protocol via OneFS and GreenPlum HD, EMC brings simplicity to Big Data in Science.” says Joshi. EMC Isilon OneFS combines the three layers of traditional storage architectures—the file system, volume manager, and RAID—into one unified software layer, creating a single intelligent distributed file system that runs on one storage cluster. Important advantages of OneFS for Hadoop are: • Scalable: Linear scale with increasing capacity – from 18TB to 16PB in a single filesystem and a single global namespace. Scale out as needs grow, independent of the compute layer. • Predictable: Dynamic content balancing is performed as nodes are added, upgraded or capacity changes. No added management time is required since this process is simple. Storage tiers without fears based • Available: OneFS protects your data from power loss, node on performance reside in one global or disk failures, loss of quorum and storage rebuild by namespace, connected via a dedicated backend network. distributing data, metadata and parity across all nodes. It also eliminates the single point of failure of a Hadoop “Name Node”. Therefore OneFS is “self healing”. • Efficient: Compared to the average 50% efficiency of traditional RAID systems, OneFS provides over 80% efficiency, independent of CPU compute or cache. This efficiency is achieved by ‘tier’ing the process into three types as shown in the figure alongside and by the pools within these node types. This efficiency extends to the reduction from a 3x copy that Hadoop requires to the >80% efficient 1x storage via EMC Isilon’s HDFS protocol. • Enterprise-ready. Administration of the storage clusters is via an intuitive Web based UI. Connectivity to your process is through standard file protocols: CIFS, SMB, NFS, FTP/ HTTP, iSCSI and HDFS. Standardized authentication and access control is available at scale: AD, LDAP and NIS. Hadoop’s Rise in Life Sciences | 6
  • 7. CONCLUSION What began as an internal project at Google in 2004 has now matured into a scalable framework for two computing paradigms that are particularly suited for the life sciences: parallelization and distribution. Indeed, the post-processing streaming data patterns for text strings, clustering and sorting – the core process patterns in the life sciences – are ideal workflows for Hadoop. Case-in-point: The CrossBow example cited earlier aligned Illumina NGS reads for SNP calling over a ‘35x’ coverage of the human genome in under 3 hours using a 40-node Hadoop cluster; an order of magnitude better than traditional HPC technology for parallel processes. The EMC Isilon OneFS distributed file system handles the Hadoop distributed file system, HDFS, just like any other shared file system, and provides a shield for the single point of failure in Hadoop: the name node. The Hybrid Cloud model (source data mirror) with Hadoop as a Service (HaaS) is the current state-of-the-art. For more information visit EMC Isilon at http://www.emc.com/isilon. Summary of Hadoop Attributes: Overview • Write Once Read Many times (WORM) • Co-locates data with compute, uses higher level architecture with Java API • HDFS is a distributed file system that runs on large clusters Advantages • Uses MapReduce framework – a batch query processor, scales linearly • EMC Isilon OneFS implements HDFS and eliminates the single point of failure, the “name node” • Standard programming language development: Java, Ruby, Python, C++ create MapReduce jobs. FUSE and WebDAV interfaces provide architectural flexibility Challenges • HDFS block size is 128 MB (can be increased), therefore large numbers of small files (<8KB) reduce its performance: use Hadoop Archive (HAR) • Data coherency and latency remain issues for large scale implementations • Not suited for low-latency, “in process” use-cases like real-time, spectral or video analysis • Data transfer between Genome sequencing data sources to the Hadoop clusters in the Cloud remains an issue, the current business model is mirroring the data between source and Cloud and then utilizing Hadoop as a Service model on the mirrored data. Hadoop’s Rise in Life Sciences | 7