Improvements of experimental technologies forces biologists to face a deluge of data that require relevant tools and sufficient resources to be analyzed. The cloud helps bioinformatics experts to define virtual appliances with pre-installed tools and workflows, and helps scientists to deploy them, on demand, on national research infrastructures.
Presented by Christophe Blanchet and Clément Cauthey at the EGI Community Forum in Manchester, UK in April 2013.
1. Christophe Blanchet, Clément Gauthey
Infrastructure Distributed for Biology
IDB-IBCP CNRS FR3302 - LYON - FRANCE
http://idee-b.ibcp.fr
IDB acknowledges co-funding by the European Community's Seventh Framework Programme (INFSO-RI-261552)
and the French National Research Agency's Arpege Programme (ANR-10-SEGI-001)
Providing Bioinformatics Services
on Cloud
C. Blanchet and C. Gauthey
EGI CF13, Manchester, 9 April 2013
Infrastructure Distributed for Biology - IDB
CNRS-IBCP FR3302, Lyon, FRANCE
2. EGI CF13, Manchester, 9 April 2013
Bioinformatics Today
• Biological data are big data
• 1512 online databases (NAR Database Issue 2013)
• Institut Sanger, UK, 5 PB
• Beijing Genome Institute, China, 4 sites, 10 PB
➡ Big data in lot of places
• Analysing such data became difficult
• Scale-up of the analyses : gene/protein to complete genome/
proteome, ...
• Lot of different daily-used tools
• That need to be combined in workflows
• Usual interfaces: portals,Web services, federation,...
➡ Datacenters with ease of access/use
• Distributed resources
• Experimental platforms: NGS, imaging, ...
• Bioinformatics platforms
➡ Federation of datacenters
ADN
BI
M
ADN
A
ADN
BI CC
BI
ADN
ADN
3. EGI CF13, Manchester, 9 April 2013
Sequencing Genomes
source: www.politigenomics.com/next-generation-sequencing-informatics
Complete genome sequencing
become a lab commodity with
NGS (cheap and efficient)
source: www.genomesonline.org
4. EGI CF13, Manchester, 9 April 2013
Infrastructures in Biology
Lot of tools
and web services
to treat and vizualize
lot of data
5. EGI CF13, Manchester, 9 April 2013
The scene
• Bioinformatics services providers
• Is it easy to deploy lot of (incompatible) tools ?
• To make them connected to public databases ?
• To limit transfer of huge data ?
• To provide users with their own computing resources ?
• With their own isolated storage ?
• Scientists
• Is it easy to access/use these tools ?
• To adapt to your usage ?
• To get your/other tools deployed on a datacenter ?
• To combine them ?
• To get my own computing/storage resources ?
6. EGI CF13, Manchester, 9 April 2013
IDB’s Cloud
• Cloud workbench for Biology
• 13 turnkey bioinformatics appliances (as of Apr. 2013)
• Running since Sept. 2011, opened to Biology community
• Lyon, FRANCE
• Powered by
• StratusLab
• Compute nodes, Block storage
• +900 cores, +4TB RAM, 36TB vdisks
• Mainly Intel SandyBridge servers with 32c 128GB
• Bigmen servers with 64c 768GB
• VMs from 1 to 64c, 512MB to 760GB RAM
• + Openstack
• Object storage (Swift)
• +200 TB redundant & scalable storage
8. EGI CF13, Manchester, 9 April 2013
Integrate Bioinformatics Tools in Cloud
BLAST
GOR4
FastA
SSearch
Abyss
ClustalW
Bioinformatics
Tools
Ray
BWA
PhyML RedHat,
CentOS
Debian,
Ubuntu
Suse
Linux
Virtual machines
Create
new
Appliance
Bioinformatics Marketplace
NGSStructure Galaxy ARIA (…)Sequence
• Appliances are virtual machines
• small : few GB, easy to convert in most virtualization formats
• Installed and pre-configured with common bioinformatics tools
• e.g. BLAST, Clustalw,ARIA, MEME, HMMer, TopHat, BWA, Samtools, etc.
13. EGI CF13, Manchester, 9 April 2013
UNIPROT
PDB
EMBL
PROSITE
Genomes
Public
Data sources
Bioinformatics
Cloud
BLAST,
Clustal,
etc.
PaaS
Workers
VM CNS
SharedFS
launch jobs
sshIaaS
Master & Storage
VM ARIA
Portal
shared
(NFS)
User
Persistent data
pdisk
(iSCSI)
Biological Data in Cloud
Upload your data
Get your results
scp http/S3
scp http/S3
14. EGI CF13, Manchester, 9 April 2013
Example:‘biocompute’ Appliance
• Use your own instance(s)
• With pre-installed
standard bioinformatics
tools
• BLAST, FastA, SSearch,HMM,...
• ClustalW2, Clustal-Omega, Muscle,..
• Bowtie(2), BWA, samtools, ...
• MEME, R, etc.
• Connected to public
reference data
• Uniprot, EMBL, genomes, PDB, etc.
• Automaticaly shared to theVMs
15. EGI CF13, Manchester, 9 April 2013
Example: Galaxy portal for NGS analyses
• Analyse NGS data
• portal Galaxy is widely used in the community
• connected to large public data: sequences and indexes
• large user data (GBs)
• Preserve workflows and results (persistent storage)
16. EGI CF13, Manchester, 9 April 2013
Example: Proteomics
• Motivation
• Collaboration with a mass spectroscopy platform
• Running out of space on their local resources
• Protein identification
• Mass experimental data
• Reference databases : nr, Swiss-Prot
• Reference screening tools:
OMSSA, X!Tandem
• User interface
• Remote display
• NX
• Reference GUIs
• SearchGUI
• PeptidShaker
source: PeptideShaker site
17. EGI CF13, Manchester, 9 April 2013
Conclusion
• Provide turnkey bioinformatics appliances
• Standard tools and pipelines
• Interoperability: ready to run on cloud
• Easier to transfer appliances than data (GB vs TB)
• Provide a cloud infrastructure tightly connected
to existing bioinformatics infrastructure
• Public IDB’s bioinformatics cloud
• Linked to public biological databases
• In collaboration with the French Bioinformatics Institute
• Ease the usage by scientists
• Usual bioinformatics gateways
• Persistent and large ubiquitous storage
• Web interface for cloud management
18. EGI CF13, Manchester, 9 April 2013
Perspectives
• Define good practices to provide academic
community and industry with bioinformatics services!
• French Bioinformatics Institute - IFB
• Goals are to provide core bioinformatics resources to the
national and international life science research community in
key fields such as genomics, proteomics, systems biology, etc.
• Aims at building a national academic cloud devoted to
Bioinformatics, inspired by the model evaluated through the
IDB’s cloud.
• European ELIXIR infrastructure
• To build a sustainable European infrastructure for biological
information, supporting life science research and its
translation
• IFB will be the French representative in ELIXIR.
19. EGI CF13, Manchester, 9 April 2013
• Acknowledgment
• StratusLab members
• co-funding by the European Community's Seventh
Framework Programme (INFSO-RI-261552) and
by the French National Research Agency's Arpege
Programme (ANR-10-SEGI-001).
Questions ?
http://idee-b.ibcp.fr