SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Why
computing
for genomics
research
sucks.
y.wurm@qmul.ac.uk
BaltiBio 2014-05-27
Example GenomicsTasks
Repetitiveness
“Disk” !
Input/Output
Memory
Duration
per task
Build 10,000 trees 10,000x low low short
Trim FASTQ files 40-400x high low short
One de novo
genome assembly
1 high high long
Many de novo
genome assemblies
20-1000x high high long
Determine which of
10 new tools that
promise X can
actually do X (once). !
“genome hacking”
1 depends depends depends
Traditional High Performance
Computing (HPC)
• Physics? Astronomy? Maths? Chemistry?
• Traditional HPC infrastructures are great at small tasks:
Repetitiveness
“Disk” !
Input/Output
Memory
Duration
per task
Build 10,000 trees 10,000x low low short
• And/or have mechanisms/tools that transform their challenges
into many small tasks.
“We have 9999 cores!” - central IT admin
but they are inadequate
Big Ass Servers
• e.g.: 1.5TB ram; 48 cores -
SSH into it and do whatever
you want.
Repetitiveness
“Disk” !
Input/Output
Memory
Duration
per task
Build 10,000 trees 10,000x low low short
Trim FASTQ files 40-400x high low short
One de novo genome
assembly
1 high high long
Many de novo genome
assemblies
20-1000x high high long
Determine which of 10
new tools that promise
X can actually do X
1 depends depends depends
Jeremy Leipzig
Additional challenges for biologists
• Datasets continue growing fast!
• Generally:
• We lack computational training.
• Bioinformatics tools suck (badly written, badly
tested, hard to install).
So what do we need?
• access to machines of all shapes and sizes
• big and small machines
• direct access via ssh (for hacking & doing things few times)
• indirect access via queue (for doing things many times)
• fast I/O - cheap archival.
• single login: all files “feel” like they’re in one place
Swiss Institute of
Bioinformatics:Vital-IT
So what do we need?
• access to machines of all shapes and sizes
• big and small machines
• direct access via ssh (for hacking & doing things few times)
• indirect access via queue (for doing things many times)
• fast I/O - cheap archival.
• single login; all files “feel” like they’re in one place
• easily changeable software & OS versions
Easily changeable OS & software versions
https://www.docker.io
>docker-switch bio-linux7
# do stuff
>docker-switch pacbio-assembly-vm
# do other stuff
>docker-switch antlab-ubuntu
# do more stuff
@bmpvieira
Easily changeable OS & software versions
https://www.docker.io
>docker-switch bio-linux7
# do stuff
>docker-switch pacbio-assembly-vm
# do other stuff
>docker-switch antlab-ubuntu
# do more stuff
FAKE
@bmpvieira
2014 05-27 - Opinion: Computing for genomics sucks.
What if Apple/Google made an
idiot-proof cloud computing
system for genomics?
What if Apple/Google made an
idiot-proof cloud computing
system for genomics?
• Always on - single place to connect to:
ssh mylab.awskiller.co.uk
• Dropbox-like shared directories & file checksumming.
• Easily switchable OS version / “VM”.
• Automagically & transparently migrates:
• from small to huge machines (and back) as CPU and RAM
demands change.
What if Apple/Google made an
idiot-proof cloud computing
system for genomics?
• Always on - single place to connect to:
ssh mylab.awskiller.co.uk
• Dropbox-like shared directories & file checksumming.
• Easily switchable OS version / “VM”.
• Automagically & transparently migrates:
• from small to huge machines (and back) as CPU and RAM
demands change.
• from one physical site (huge dataset) to another
Summary
• Broad range of needs:!
• some similar to traditional HPC.!
• some very different!!
• Users are naive.!
• Tools are experimental.!
• Datasets are experimental.!
• IT people have difficulty understanding this.
• Do not trust them when they say things will just work!
!
• A lot of potential to make things not suck.
Evolutionary Genetics group
& Queen Mary U London
Bruno Vieira - @bmpvieira
Steve Moss - @gawbul
Anurag Priyam - @yeban
Richard Christie & ITS
Research Support team @
Queen Mary U London
Ioannis Xenarios & Vital-IT
team @ Swiss Institute of
Bioinformatics
http://yannick.poulet.orgy.wurm@qmul.ac.uk

Weitere ähnliche Inhalte

Mehr von Yannick Wurm

2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomicsYannick Wurm
 
2018 08-reduce risks of genomics research
2018 08-reduce risks of genomics research2018 08-reduce risks of genomics research
2018 08-reduce risks of genomics researchYannick Wurm
 
2017 11-15-reproducible research
2017 11-15-reproducible research2017 11-15-reproducible research
2017 11-15-reproducible researchYannick Wurm
 
2016 05-31-wurm-social-chromosome
2016 05-31-wurm-social-chromosome2016 05-31-wurm-social-chromosome
2016 05-31-wurm-social-chromosomeYannick Wurm
 
2016 05-30-monday-assembly
2016 05-30-monday-assembly2016 05-30-monday-assembly
2016 05-30-monday-assemblyYannick Wurm
 
2016 05-29-intro-sib-springschool-leuker bad
2016 05-29-intro-sib-springschool-leuker bad2016 05-29-intro-sib-springschool-leuker bad
2016 05-29-intro-sib-springschool-leuker badYannick Wurm
 
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...Yannick Wurm
 
2015 11-17-programming inr.key
2015 11-17-programming inr.key2015 11-17-programming inr.key
2015 11-17-programming inr.keyYannick Wurm
 
2015 11-10-bio-in-docker-oswitch
2015 11-10-bio-in-docker-oswitch2015 11-10-bio-in-docker-oswitch
2015 11-10-bio-in-docker-oswitchYannick Wurm
 
Week 5 genetic basis of evolution
Week 5   genetic basis of evolutionWeek 5   genetic basis of evolution
Week 5 genetic basis of evolutionYannick Wurm
 
Biol113 week4 evolution
Biol113 week4 evolutionBiol113 week4 evolution
Biol113 week4 evolutionYannick Wurm
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible researchYannick Wurm
 
2015 10-7-9am regex-functions-loops.key
2015 10-7-9am regex-functions-loops.key2015 10-7-9am regex-functions-loops.key
2015 10-7-9am regex-functions-loops.keyYannick Wurm
 
2015 9-30-sbc361-research methcomm
2015 9-30-sbc361-research methcomm2015 9-30-sbc361-research methcomm
2015 9-30-sbc361-research methcommYannick Wurm
 
2015 09-29-sbc322-methods.key
2015 09-29-sbc322-methods.key2015 09-29-sbc322-methods.key
2015 09-29-sbc322-methods.keyYannick Wurm
 
2015 09-28 bio721 intro
2015 09-28 bio721 intro2015 09-28 bio721 intro
2015 09-28 bio721 introYannick Wurm
 
Sustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshopSustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshopYannick Wurm
 

Mehr von Yannick Wurm (20)

2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
 
2018 08-reduce risks of genomics research
2018 08-reduce risks of genomics research2018 08-reduce risks of genomics research
2018 08-reduce risks of genomics research
 
2017 11-15-reproducible research
2017 11-15-reproducible research2017 11-15-reproducible research
2017 11-15-reproducible research
 
2016 05-31-wurm-social-chromosome
2016 05-31-wurm-social-chromosome2016 05-31-wurm-social-chromosome
2016 05-31-wurm-social-chromosome
 
2016 05-30-monday-assembly
2016 05-30-monday-assembly2016 05-30-monday-assembly
2016 05-30-monday-assembly
 
2016 05-29-intro-sib-springschool-leuker bad
2016 05-29-intro-sib-springschool-leuker bad2016 05-29-intro-sib-springschool-leuker bad
2016 05-29-intro-sib-springschool-leuker bad
 
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...
 
2015 11-17-programming inr.key
2015 11-17-programming inr.key2015 11-17-programming inr.key
2015 11-17-programming inr.key
 
2015 11-10-bio-in-docker-oswitch
2015 11-10-bio-in-docker-oswitch2015 11-10-bio-in-docker-oswitch
2015 11-10-bio-in-docker-oswitch
 
Week 5 genetic basis of evolution
Week 5   genetic basis of evolutionWeek 5   genetic basis of evolution
Week 5 genetic basis of evolution
 
Biol113 week4 evolution
Biol113 week4 evolutionBiol113 week4 evolution
Biol113 week4 evolution
 
Evolution week3
Evolution week3Evolution week3
Evolution week3
 
2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research2015 10-7-11am-reproducible research
2015 10-7-11am-reproducible research
 
2015 10-7-9am regex-functions-loops.key
2015 10-7-9am regex-functions-loops.key2015 10-7-9am regex-functions-loops.key
2015 10-7-9am regex-functions-loops.key
 
Evolution week2
Evolution week2Evolution week2
Evolution week2
 
2015 9-30-sbc361-research methcomm
2015 9-30-sbc361-research methcomm2015 9-30-sbc361-research methcomm
2015 9-30-sbc361-research methcomm
 
2015 09-29-sbc322-methods.key
2015 09-29-sbc322-methods.key2015 09-29-sbc322-methods.key
2015 09-29-sbc322-methods.key
 
Sbc322 intro.key
Sbc322 intro.keySbc322 intro.key
Sbc322 intro.key
 
2015 09-28 bio721 intro
2015 09-28 bio721 intro2015 09-28 bio721 intro
2015 09-28 bio721 intro
 
Sustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshopSustainable software institute Collaboration workshop
Sustainable software institute Collaboration workshop
 

2014 05-27 - Opinion: Computing for genomics sucks.

  • 2. Example GenomicsTasks Repetitiveness “Disk” ! Input/Output Memory Duration per task Build 10,000 trees 10,000x low low short Trim FASTQ files 40-400x high low short One de novo genome assembly 1 high high long Many de novo genome assemblies 20-1000x high high long Determine which of 10 new tools that promise X can actually do X (once). ! “genome hacking” 1 depends depends depends
  • 3. Traditional High Performance Computing (HPC) • Physics? Astronomy? Maths? Chemistry? • Traditional HPC infrastructures are great at small tasks: Repetitiveness “Disk” ! Input/Output Memory Duration per task Build 10,000 trees 10,000x low low short • And/or have mechanisms/tools that transform their challenges into many small tasks.
  • 4. “We have 9999 cores!” - central IT admin but they are inadequate
  • 5. Big Ass Servers • e.g.: 1.5TB ram; 48 cores - SSH into it and do whatever you want. Repetitiveness “Disk” ! Input/Output Memory Duration per task Build 10,000 trees 10,000x low low short Trim FASTQ files 40-400x high low short One de novo genome assembly 1 high high long Many de novo genome assemblies 20-1000x high high long Determine which of 10 new tools that promise X can actually do X 1 depends depends depends Jeremy Leipzig
  • 6. Additional challenges for biologists • Datasets continue growing fast! • Generally: • We lack computational training. • Bioinformatics tools suck (badly written, badly tested, hard to install).
  • 7. So what do we need? • access to machines of all shapes and sizes • big and small machines • direct access via ssh (for hacking & doing things few times) • indirect access via queue (for doing things many times) • fast I/O - cheap archival. • single login: all files “feel” like they’re in one place
  • 9. So what do we need? • access to machines of all shapes and sizes • big and small machines • direct access via ssh (for hacking & doing things few times) • indirect access via queue (for doing things many times) • fast I/O - cheap archival. • single login; all files “feel” like they’re in one place • easily changeable software & OS versions
  • 10. Easily changeable OS & software versions https://www.docker.io >docker-switch bio-linux7 # do stuff >docker-switch pacbio-assembly-vm # do other stuff >docker-switch antlab-ubuntu # do more stuff @bmpvieira
  • 11. Easily changeable OS & software versions https://www.docker.io >docker-switch bio-linux7 # do stuff >docker-switch pacbio-assembly-vm # do other stuff >docker-switch antlab-ubuntu # do more stuff FAKE @bmpvieira
  • 13. What if Apple/Google made an idiot-proof cloud computing system for genomics?
  • 14. What if Apple/Google made an idiot-proof cloud computing system for genomics? • Always on - single place to connect to: ssh mylab.awskiller.co.uk • Dropbox-like shared directories & file checksumming. • Easily switchable OS version / “VM”. • Automagically & transparently migrates: • from small to huge machines (and back) as CPU and RAM demands change.
  • 15. What if Apple/Google made an idiot-proof cloud computing system for genomics? • Always on - single place to connect to: ssh mylab.awskiller.co.uk • Dropbox-like shared directories & file checksumming. • Easily switchable OS version / “VM”. • Automagically & transparently migrates: • from small to huge machines (and back) as CPU and RAM demands change. • from one physical site (huge dataset) to another
  • 16. Summary • Broad range of needs:! • some similar to traditional HPC.! • some very different!! • Users are naive.! • Tools are experimental.! • Datasets are experimental.! • IT people have difficulty understanding this. • Do not trust them when they say things will just work! ! • A lot of potential to make things not suck.
  • 17. Evolutionary Genetics group & Queen Mary U London Bruno Vieira - @bmpvieira Steve Moss - @gawbul Anurag Priyam - @yeban Richard Christie & ITS Research Support team @ Queen Mary U London Ioannis Xenarios & Vital-IT team @ Swiss Institute of Bioinformatics http://yannick.poulet.orgy.wurm@qmul.ac.uk