William Spooner (Eagle) and Carl Chesal (Cycle) introduce the proof of concept provided by this consortium for Phase 2 of the Pistoia Alliance Sequence Services project. The presentation was delivered at the Pistoia Alliance Conference in Boston, MA, on April 24, 2012.
2. Collaborate Explore
Enterprise Work together
Academia to find a
Government common
Foundations Open purpose
Innovation
Nurture Exploit
Build trust, Turn ideas into
shared tangible benefits
language
2/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
3. The Requirements
FUNCTIONAL NON-FUNCTIONAL
Login and workspace Charging Model
Manage users Service Support
Manage data
Operational
Upload private data Requirements
Access public data
Security Requirements
Export
Delete/archive
Manage applications Share
Upload scripts/pipelines
Analyse data
Monitor use/performance
3/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
4. The Partnership
Established: 2005 2008
Domain: High performance Operational bioinformatics
computing
Employees: 18, 16 engineers 12, 9 engineers, pool of
external consultants
Location: Across USA/Canada Cambridge, UK
Sectors: Pharmaceutical, Pharmaceutical,
biotechnology, financial, biotechnology, agri-
computer gaming, biotechnology, consumer
engineering, academia. goods, food, other life
sciences.
Customers: North America, Europe North America, Europe, Asia
Partnerships: Schrodinger, VMWare, Amazon Web Services,
Canonical Cognizant, European
4/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012 Bioinformatics Institute,
5. The Platform
The platform for storage,
analysis and sharing of life
sciences data in the cloud
5/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
6. The Proposal
CIO Bioinformatician
Start Upload
Stored ANALYSES
Depositor data
Manual
process
Biologist
Pipeline
Stored process
Stop Share
Collaborator data
6/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
7. The Architecture
Customer
Bioinformaticia Collaborato Single
Authenticate
Depositor Sign On
n r
SAML
Token Exchange
SAML
HTTPS
OpenAM
Web
IdP
Amazon EC2 Cloud
Gateway Shiboleth
HTTPS Assets HTTPS
MySQL DB
Web Web
Web Server SEEK Web Server CycleCloud
HTTPS Data HTTPS
Web Data
Files Condor Web
Data Fi
Files
Data Fi
Data Fi
Data Fi Encrypt/ Ensembl
Decrypt
BioLinux
Customer Customer
Sandbox Customer
Sandbox
S3 Storage Sandbox
EC2/AMIs
EC2/AMIs
7/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
8. The Present
Collaborator Depositor
Bioinformatician
8/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
9. The1000 Genomes
• A Deep Catalogue of Human
Variation
– Freely available on AWS
– 1,700 Individuals
– 200Tb data
– 10,000s data files
– Almost no metadata!
• ElasticAP evaluating 1000
Genomes Project Pilot 2
– 20X resequencing
– 2 trios (6 individuals)
10. TRUP: Tumor RNA-seq
Unified Pipeline
• Collaboration between
– Max Planck Institute for Molecular
Genetic
– Bayer Pharma AG
• Identifies gene fusion events in
tumor samples
• Involves both alignment and de-
novo sequencing steps
• Pipeline is being implemented on
ElasticAP
– Using public GEO datasets for
11. The PoC
FUNCTIONAL NON-FUNCTIONAL
Login and workspace Charging Model
Load data Service Support
Manage public data Operational
Load scripts and Requirements
pipelines Security Requirements
Analyse data
Export data
Archive data KEY
Manage applications Fully implemented
Partially implemented
Manage users To-do list
Monitor use/performance
11/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
12. The Prior Art
• Eagle have been building analysis pipelines and
hosting secure cloud apps for years.
• Cycle have been developing HPC solutions and
deploying them on the cloud for years
• We built this as a platform we could use
ourselves in order to carry on delivering what we
already do.
• But now the results are interactive, and everyone
can share and participate.
• The most common tasks won’t need to
involve us at all. th
12/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24 April 2012
13. The Price
• AWS-style pay as you go business model
– Free sign-up and account creation
– Tiered applications by the hour.
– Discounts for up-front reservation fee.
– Offline data import/export also available.
– Flat-rate data by the gigabyte-month.
– Backup data by the gigabyte-month.
– Monthly billing.
– Support contracts available.
• Customisation and new pipelines at Eagle/Cycle
standard consulting rates.
13/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
14. The Plan
• Early access to preferred partner customers in
July
– talk to us now if you’d like to be part of that.
• Full production in September with all partial/todo
items implemented.
• Increased number of public datasets.
• Increased range of applications and pipelines.
• User interface improvements based on feedback
from early access period.
14/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
15. The Potential
• Available as customisation projects:
– Conversions to other clouds.
– Conversions to run on in-house infrastructure.
• Truly secure and scalable R&D collaboration
environment.
– Applicable to all sciences, not just genomics.
Change the way
you do science
15/ ElasticAP, Pistoia Alliance Conference, Boston MA, 24th April 2012
Hello and welcome. I’m Will Spooner, CTO at Eagle GenomicsI’m proud to be here to day to present the Eagle/Cycle partnership’s Proof of concept solution to the Pistoia Sequence services phase 2 RfPIf you want to see the system in action and see how it looks, come to our demo later.
Firstly, I would like to thank the Pistoia Alliance for backing not just our project, but for being a driving force behind open innovation in the life sciences.I firmly believe that Pistoia’s approach to open innovation is, itself, truly innovative. I suggest that it is the focus on exploitation [translating ideas into deliverables] That really sets Pistoia apart.Initiatives like Sequence Services are a great example of this. We have been conceptually designing a platform for bioinformatics analysis for the past 4 years But it would probably still be on the drawing board without the support and backing of the Pistoia Alliance.
So, what is sequence services?“It is a Vision for a platform to users; where they can solve scientific problemsbased on DNA/RNA sequence informationtailored to the needs of the pharmaceutical industry.”The exact requirements to Phase 2 were set out in a 27 page “Request For Proposals” document, This is a fantastic synthesis of the, various priorities from several different pharmaFocused not just on managing NGS data and applications, but also on collaborating with research partners in a secure environment.
Eagle participated in Pistoia Sequence Services Phase 1 Partnered with Cognizant. We built a cloud architecture to host Ensembl and other sequence analysis applications from scratch. However, phase 2 was sufficiently broader in scope that we needed to build from an existing cloud architecture platform.Eagle had seen Cycle present at a conference and were impressed by their offering, Which was perfect for this particular use-case.The partnership between Eagle and Cycle was born in a meeting between Richard Holland (Eagle CBO) and Jason Stowe (Cycle CEO) in a Starbucks outside New York Penn Station – humble beginnings often lead to great things!Brings together two companies with shared values;Young, dynamic, ambitious,Dedicated to providing exceptional services,Acknowledged domain experts,Virtualisation/cloud evangelists,Implementers of progressive business models, covering both;Software as a service,Professional open-source.Supporters of and contributors to open-source,With highly-satisfied international, multi-sector client bases.
So, over the past 3 months, with backing from the Pistoia Alliance, We have built a platform capable of delivering pretty much all of the requirements set out in the RFP.I’m pleased to be able to introduceThe Elastic Analysis Platform [ElasticAP] The software as a service for storage, analysis and, importantly, sharing of life sciences data and applications on the cloud. The logo represents a ball of rubber bands? Why?Many parts make up the whole, but the whole is more than the sum of its parts. It is flexible and all parts are interchangeable. The whole thing is robust, secure and bounces back if you drop it
From a birds eye view, what weprovide ispretty simple.The platform enables users to upload, store, analyse, and share, Not just data, but applications as well.For example, a sequencing vendor that the biologist has commissioned uploads data to the platform.This data is shared with the customer, who grants access to the bioinformaticianThe bioinformatician analyses the data using either manual or automated workflows.The results of these analyses is shared with the biologist and subsequently, if appropriate, with collaborators, such as academic partners the Biologist is working with
Our approach to the implementation has been to adopt a Loosely coupled architectureSelect best of breed components Many of which are open source And connect them using industry standards, protocol, and web services Users interact with ElasticAP through a single point of entry embedded within a security gateway The Gateway supports customer-integrated single sign on Behind the gateway there is the Workspace for management of data and applications, From where data analysis tools can be launched and monitoredOther than the workspace, the system is single tenancy, Meaning customer’s data and applications are contained in isolated sandboxes, increasing security
We have spent a great deal of time and effort cementing the core infrastructure of ElasticAP without which none of the rest of the platform can function correctly.Now that this work is complete, we are racing ahead with populating the system with data and applicationsFor public data sets, we have populated the system with sample Ensembl, UCSC reference genomes.For bioinformatics applications, we are packaging those developed for Sequence Services phase 1, and have the Ensembl Genome Browser ready.For pipelines, we have BioLinux + Galaxy, that we expect to have available very shortlyPlease come to the demo in this room after the session, where I will show some of the functionality,
On 29th March, Amazon announced the availability of the 1000 Genomes public dataset, the largest catalog of human genetics, on the cloud. This is one of the datasets targeted for inclusion by P-SS2. This is fortunate timing for ElasticAP. Rather than spending time importing all of the data into the system we can focus on adding value through adding metadata to the AWS data files. This is facilitated through existing mappings between the ENA metadata to the industry standard ‘Investigation Study Assay’ format supported by our underlying softwareWe are starting with a single 1000 Genomes study, 1000Genomes Project Pilot 2, 20X whole genome sequence coverage of 6 individuals (consisting 2 trios: father, mother, offspring) from CEU and YRI population. Trios are interesting: In addition to variant detection, allows investigation of haplotype phasing and de-novo mutation events. I will show you some of these data at the demo.
Good real-world example. Current collaboration between a pharma and a universityComplex pipeline that involves both alignment and de-novo assembly stepsWe are using this as a case study for ElasticAP, using a similar, public dataset from GEO (another of the SS2-requested datasets) as a proxy for the real in-house data set.
Pistoia RFP was huge – not possible to implement every single part of it from the word go.Security is the first and most important part – and we’ve definitely finished and tested that part rigorously. Data is safe.Work is underway to implement the partial and todoitems above
It’s worth noting that we’re improving on existing technology here, not building something brand new.We’re taking components we’ve used before and combining with cutting-edge third-party open-source components.We add to this Cycle’s unique insight into cloud scalability and security. Cycle made a splash recently with their 50,000 node cluster. See yesterdays NYT articleNow we can use the platform internally to develop customer projects, freeing us from building new infrastructural support for each new project and increasing our efficiency.Same goes for end-users – no longer have to build their own infrastructure or wait for us to build ours for them, they can just log in and use it.See our brochures or websites for details of the kinds of things we’ve been involved in and will use this platform for in future
Pricing follows an AWS-style model Sign-up is free Data import is free And you only pay for the storage/compute that you useWe support either up-front payment Or monthly billingSLA-backed support contracts are availableAnd customisations are, of course, available at standard consulting rates.
Early access for preferred partnersEarly access to test and try out featuresStress-test the system and give feedbackInlonger term can expand outside sequence data into other R&D fieldsSecurity will always be at the heart of it.
Whether you subscribe to ElasticAP or look elsewhere,CloudSaaSis the future for data analysis, especially when collaboration is concerned.We are not intrinsically tied to any flavor of cloudOr even tied to the cloud at allWe strongly believe we have the nucleus of a truly secure and scalable R&D collaboration environmentThat can be extended beyond genomics to general R&D applications
If you have any other questions, look out for myself or David Flanders [TBC] from Eagle, or Carl Chesal from Cycle.I will also be attending Bio-IT World this week if you’d like to catch up with me there.In case you don’t get the chance to talk to me, we’ve got product brochures in the demo room for you to take away. Thanks for listening, see you at the demo.