"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Ntino Cloud BioLinux Barcelona Spain 2012
1. Cloud BioLinux: Pre-configured Bioinformatics
Computing for the Genomics Community
Ntino Krampis
Asst. Professor - Informatics
J. Craig Venter Institute
kkrampis@jcvi.org
http://www.jcvi.org/cms/about/bios/kkrampis/
Tuesday, November 6, 12
2. J. Craig Venter Institute ( JCVI )
• Human Microbiome
Project (Nelson et al. Science
2010; 328: 994–99)
• NIH funded, launched in
2008, $115 million
• metagenomic sequencing
of microbial genomes
from the human body
• sequence everything in
sample, use informatics to
separate genomes
Tuesday, November 6, 12
3. J. Craig Venter Institute
• Global Ocean Survey
(first publication, Venter et al.
Science 2004; 304: 66-74)
• metagenomic sequencing
of microbes from oceans
around the world
• Darwin’s route ?
• Numbers: HMP > 2 mil.
new proteins, GOS > 1.2
Tuesday, November 6, 12
4. Big Data and sequencing
• JCVI sequencing facility:
454, Solexa, HiSeq, and
IonTorrent on the way
• Processed data: size
information content
• But... look at SOLiD 3
Source:
http://www.politigenomics.com/next-generation-
sequencing-informatics
Tuesday, November 6, 12
5. JCVI: sequencing and computing
infrastructure
• “big” sequencing needs
large-scale informatics
• ~1000 node Grid Engine
cluster
• research with Hadoop /
MapRecuce, and a small
private cloud
• 50+ bioinformaticians and
software developers
Tuesday, November 6, 12
6. A new paradigm:
Low-cost, bench-top sequencers
• GS Junior - 454, MiSeq -Illumina
• complete sequencing of
bacterial, viral, fungal genomes
• RNAseq (gene expression),
ChiPseq (protein interactions),
gene variant discovery
• sequencing as a standard
technique in basic genetics
research - like PCR ?
Tuesday, November 6, 12
7. Will smaller academic labs become the
long tail of sequencing ?
“sequencing factories” :
JCVI, Broad Inst.
Washington Univ.
Amount Inst. of Genome Sciences
of small academic labs with
sequencing bench-top sequencers
Number of labs
Tuesday, November 6, 12
8. Sequencers shipped without clusters
• Problem A : sequence
analysis requires
computational capacity
• genome assembly, BLAST,
gene finders - annotation
• Problem B: bioinformatics ???
tools need software
engineering expertise
• unix/linux operating
systems, maintaining
software libraries,
compiling source code
Tuesday, November 6, 12
9. Each lab builds a cluster ?
• need additional funds to
buy the hardware
• funds for personnel to
maintain the cluster and
software
• duplication of effort
across labs
• sub-optimal utilization of
the hardware
Tuesday, November 6, 12
10. Centralized bioinformatics services
• Bioinformatic Resource
Centers ex. GSCID
• bioinformatic services
usually coupled with
sequencing of a genome
• provide mostly data access
to external PIs
• cannot support to every
lab with a sequencer
Tuesday, November 6, 12
11. Problem A : sequence analysis requires
computational capacity
• Amazon Elastic Compute
Cloud (EC2), pay-by-the-
hour computing
• cloud servers cost
$0.085 - $2 per hour
• max capacity 64GB RAM /
8 CPU (can boot
hundreds of servers) World-wide data centers
750 hours free for new users: aws.amazon.com/free/
free compute for teaching: aws.amazon.com/grants/
Tuesday, November 6, 12
12. Cloud Computing and Virtualization
• OS, software and data,
pre-installed in Virtual
Machine (VM)
• cloud provider: hardware
and virtualization layer
• VM is a full-featured
server in a single file
• VM transfer on private
cloud
Credit: VMware Inc.
Tuesday, November 6, 12
13. Problem B: bioinformatics tools need
software engineering expertise
• VM with pre-installed software
on the cloud
• avoid compiling source code, or
other software dependencies
• rent computational capacity, on
a pay as you go basis
• run the VM on the closest
Amazon data center
Tuesday, November 6, 12
14. Solving Problems A & B :
Cloud BioLinux
• Cloud BioLinux: publicly
accessible VM on EC2
• 100+ pre-installed
bioinformatics tools
• remote desktop for non-
command line experts
• you can create a cluster with
Cloud BioLinux - CloudMan Krampis K, Booth T, Chapman B, Tiwari B, Bicak M,
Field D, Nelson K
Cloud BioLinux: pre-configured and on-demand
bioinformatics computing for the genomics community.
BMC Bioinformatics. 2012 Mar 19; 13: 42.
Tuesday, November 6, 12
23. Cloud computing research at JCVI
• open-source cloud
platforms, fully compatible
with Amazon EC2
• active funding, NIAID viral
genomics pipeline on cloud
• end-to-end, sequence to
assembly, annotation,
visualization via Galaxy
• run on Amazon, private
cloud, or desktop
Tuesday, November 6, 12
24. Scriptable Cloud Infrastructures
Fabric
framework • Cloud BioLinux VM
configuration in plain text
• high-level configuration,
software groups
• each group individual
bioinformatics tools
Tuesday, November 6, 12
25. Scriptable Cloud Infrastructures
• Python Fabric leverages
Linux packages (APTitude
repositories)
• mix and match software
from repositories
• share VM configuration as
source code
• clone across clouds
Krampis K, Booth T, Chapman B, Tiwari B, Bicak M, Field D, Nelson K
Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community.
BMC Bioinformatics. 2012 Mar 19; 13: 42.
Tuesday, November 6, 12
26. Scalable Data Analysis
• Cloud BioLinux + Cloudman
• dual role : Master / Worker
• Cloud BioLinux VM, has
Cloudman scripts that start
more copies of itself
• Grid Engine (SGE) cluster
• http://usecloudman.org/
Afgan, E., Chapman, B. et al. (2012). Using Cloud
Computing Infrastructure with CloudBioLinux, CloudMan,
and Galaxy.Current Protocols in Bioinformatics, 11-9.
Tuesday, November 6, 12
29. From sequencer to the cloud
credit:
basespace.illumina.com
Tuesday, November 6, 12
30. Acknowledgments
• Cloud BioLinux community: cloudbiolinux.org
Brad Chapman, Enis Afgan,Tim
Booth, Mesude Bicak, Dawn Field groups.google.com/group/cloudbiolinux
• JCVI collaborators: Alex Richter, tinyurl.com/cloudboot1
Ravi Sanka, Andrey Tovichgrechko,
Johannes Goll, Karen Nelson, Bill tinyurl.com/cloudboot2
Nierman, JCVI IT support.
kkrampis@jcvi.org
• NIAID and for funding:
Maria Giovani, Punam Mathur
slideshare.com/agbiotec
Thank you !
Tuesday, November 6, 12