Use of mutants in understanding seedling development.pptx
The case for cloud computing in Life Sciences
1. The case for cloud computing in the life
sciences
Ola Spjuth <ola.spjuth@farmbio.uu.se>
Department of Pharmaceutical Biosciences
and Science for Life Laboratory
Uppsala University
2. About me
• Ola Spjuth, Docent
• Associate Professor at Uppsala University
– Data-intensive and translational bioinformatics (http://pharmb.io)
• Head of Bioinformatics Compute and Storage facility at SciLifeLab
– Responsible for managing resources
– Strategic e-infra planning and procurement for SciLifeLab
• Deputy Director at SNIC-UPPMAX HPC center
• Guest Researcher at Karolinska Institutet
– e-Science for Cancer Prevention and Control (eCPC), flagship
project at SeRC
2
5. Today: We have access to high-throughput
technologies to study biological phenomena
6. Science for Life Laboratory
An internationally leading center
that develops and applies
large-scale technologies for
molecular biosciences with a
focus on health and
environment.
Became a national platform in 2013
Stockholm node
Uppsala node
7. 2017: Human whole genome sequenced
in 3 days for ~$1100
…requires supercomputers
for analysis and storage
Massively parallel sequencing….
2017: Illumina HiSeq X systems. 15K whole human
genomes per year
2016: NGI data velocity 950 Mbp/hour = 16 Mbp/s
10. Some statistics Storage usage
Projects at SNIC-UPPMAX
Data-intensive bioinformatics
Other disciplines
Support tickets
11. New challenges: Data management and
analysis
• Storage
• Analysis methods, pipelines
• Scaling
• Automation
• Data integration, security
• Predictions
• …
12. Why cloud in the life sciences?
• Access to resources
– Flexible configurations
– On-demand
– Cost-efficient?
• Collaborate on international level
– Publish/federate data
– E.g. Large sequencing initiatives, “move compute to the
data”
• New types of analysis environments
– Hadoop/Spark/Flink etc.
– Microservices, Docker, Kubernetes, Mesos
12
13. Challenges with cloud
• Tradition: Strong HPC tradition in academia
– Existing resources funded by Research Council and
personnel at 6 centra in Sweden (SNIC)
• Economy: Cost model is new
– Difficult to assess the costs
• Legal: Working with sensitive data
• Educational: New technology for many
13
14. Needs in bioinformatics
• Primarily resources with a lot of RAM and storage (high I/O)
• Preferably transparent system, users don’t want to deal with e-
infrastructure at all
• How to work with storage (tiered?)
14
15. My research focus
e-infrastructure development
Automation, Big Data
e-Science methods development
Prediction models,
machine learning
Applied e-Science research
Drug discovery and
individualized diagnostics
16. Selected research in my group
Privacy
preservation
Workflows
Big Data
frameworks
Data management and
predictive modeling
Data
federation
Compute
federation
20. Cloudflare
kubeadm Terraform
kubectl
Packer
• Enable users to deploy their own virtual
infrastructure on an IaaS provider
• Containerize tools, orchestrate with workflow
systems on top of Kubernetes
PhenoMeNal approach and
stack
KubeNow
21. Hierarchical Analysis of Temporal and
Spatial Image Data
21
Carolina Wählby
PI, PhD, Professor in Quantitative Microscopy
Andreas Hellander
Co-PI, Associate Professor
Ola Spjuth
Co-PI, Associate Professor
www.cb.uu.se/~carolina/HATSID.html
22. Presenting at Spark Summit 2017:
“EasyMapReduce: Leverage the power of Spark And
Docker To scale scientific tools in MapReduce
fashion“
22https://spark-summit.org/east-2017/events/easymapreduce-leverage-the-
power-of-spark-and-docker-to-scale-scientific-tools-in-mapreduce-fashion/
23. Our most recent scientific publication
23
http://jcheminf.springeropen.com/articles/10.1186/s13321-017-0204-4
24. European Open Science Cloud (EOSC)
• The vast majority of all data in the world (in fact up to 90%) has been
generated in the last two years.
• Scientific data is in direct need of openness, better handling, careful
management, machine actionability and sheer re-use.
• European Open Science Cloud: A vision of a future infrastructure to
support Open Research Data and Open Science in Europe
– It should enable trusted access to services, systems and the re-use
of shared scientific data across disciplinary, social and geographical
borders
– research data should be findable, accessible, interoperable and re-
usable (FAIR)
– provide the means to analyze datasets of huge sizes
24http://ec.europa.eu/research/openscience/index.cfm?pg=open-science-cloud
Hinweis der Redaktion
Strategic funding to enable:
Infrastructure for high-throughput analysis
Multi-disciplinary research environment
Competence in technology and analysis methodology
Access to computers (many if you need)
Access to storage (a lot if you need)
Pre-installed software and reference genomes
Free
How improve efficiency on shared HPC for data-intensive bioinformatics?
Can Cloud Computing and Big Data frameworks aid data-intensive research?
How useful are Scientific Workflows in data-intensive research?
Can predictive modeling aid data acquisition, storage and analysis?
How can we continuously improve predictive models as data changes?