Some thoughts on 1. why the genomics bioinformaticians need hardware that differs from what traditional HPC providers provide 2. why its challenging to get it.
With input from @bmpvieira, @yeban, @gawbul .
Video: https://www.youtube.com/watch?v=mmMQw2gIozI
2. Example GenomicsTasks
Repetitiveness
“Disk” !
Input/Output
Memory
Duration
per task
Build 10,000 trees 10,000x low low short
Trim FASTQ files 40-400x high low short
One de novo
genome assembly
1 high high long
Many de novo
genome assemblies
20-1000x high high long
Determine which of
10 new tools that
promise X can
actually do X (once). !
“genome hacking”
1 depends depends depends
3. Traditional High Performance
Computing (HPC)
• Physics? Astronomy? Maths? Chemistry?
• Traditional HPC infrastructures are great at small tasks:
Repetitiveness
“Disk” !
Input/Output
Memory
Duration
per task
Build 10,000 trees 10,000x low low short
• And/or have mechanisms/tools that transform their challenges
into many small tasks.
4. “We have 9999 cores!” - central IT admin
but they are inadequate
5. Big Ass Servers
• e.g.: 1.5TB ram; 48 cores -
SSH into it and do whatever
you want.
Repetitiveness
“Disk” !
Input/Output
Memory
Duration
per task
Build 10,000 trees 10,000x low low short
Trim FASTQ files 40-400x high low short
One de novo genome
assembly
1 high high long
Many de novo genome
assemblies
20-1000x high high long
Determine which of 10
new tools that promise
X can actually do X
1 depends depends depends
Jeremy Leipzig
6. Additional challenges for biologists
• Datasets continue growing fast!
• Generally:
• We lack computational training.
• Bioinformatics tools suck (badly written, badly
tested, hard to install).
7. So what do we need?
• access to machines of all shapes and sizes
• big and small machines
• direct access via ssh (for hacking & doing things few times)
• indirect access via queue (for doing things many times)
• fast I/O - cheap archival.
• single login: all files “feel” like they’re in one place
9. So what do we need?
• access to machines of all shapes and sizes
• big and small machines
• direct access via ssh (for hacking & doing things few times)
• indirect access via queue (for doing things many times)
• fast I/O - cheap archival.
• single login; all files “feel” like they’re in one place
• easily changeable software & OS versions
10. Easily changeable OS & software versions
https://www.docker.io
>docker-switch bio-linux7
# do stuff
>docker-switch pacbio-assembly-vm
# do other stuff
>docker-switch antlab-ubuntu
# do more stuff
@bmpvieira
11. Easily changeable OS & software versions
https://www.docker.io
>docker-switch bio-linux7
# do stuff
>docker-switch pacbio-assembly-vm
# do other stuff
>docker-switch antlab-ubuntu
# do more stuff
FAKE
@bmpvieira
14. What if Apple/Google made an
idiot-proof cloud computing
system for genomics?
• Always on - single place to connect to:
ssh mylab.awskiller.co.uk
• Dropbox-like shared directories & file checksumming.
• Easily switchable OS version / “VM”.
• Automagically & transparently migrates:
• from small to huge machines (and back) as CPU and RAM
demands change.
15. What if Apple/Google made an
idiot-proof cloud computing
system for genomics?
• Always on - single place to connect to:
ssh mylab.awskiller.co.uk
• Dropbox-like shared directories & file checksumming.
• Easily switchable OS version / “VM”.
• Automagically & transparently migrates:
• from small to huge machines (and back) as CPU and RAM
demands change.
• from one physical site (huge dataset) to another
16. Summary
• Broad range of needs:!
• some similar to traditional HPC.!
• some very different!!
• Users are naive.!
• Tools are experimental.!
• Datasets are experimental.!
• IT people have difficulty understanding this.
• Do not trust them when they say things will just work!
!
• A lot of potential to make things not suck.
17. Evolutionary Genetics group
& Queen Mary U London
Bruno Vieira - @bmpvieira
Steve Moss - @gawbul
Anurag Priyam - @yeban
Richard Christie & ITS
Research Support team @
Queen Mary U London
Ioannis Xenarios & Vital-IT
team @ Swiss Institute of
Bioinformatics
http://yannick.poulet.orgy.wurm@qmul.ac.uk