Better science through superior software

•

0 gefällt mir•620 views

Presentation given to the BEACON 2013 Congress during the "Collaborating with Industry" sandbox Original w/ slide notes at: https://docs.google.com/presentation/d/1mmvD0R3fLIl11TmFHij5fGcMDb9qJxy_nwENO2Rt-YI/edit?usp=sharing

Technologie

Better science through
superior software
Michael R. Crusoe
Software Engineer & Bioinformatician
The GED Lab @ Michigan State
mcrusoe@msu.edu @biocrusoe

Open, online science
Much of the software and approaches talked
about today are available:
khmer software:
http://github.com/ged-lab/khmer/
Titus’s blog: http://ivory.idyll.org/blog/
Titus’s twitter: @ctitusbrown

Overview
● Next-gen sequencing data deluge
● ♫How do you solve a problem like big data?♫
● Impact of khmer software
● Future work
● Being a good F/OSS community member and
leading by example
● Acknowledgements

Problem
“The power of next-gen. sequencing: get 180x
coverage... and then watch your assemblies
never finish” - Erich Schwarz

“Three types of data scientists.”
(Bob Grossman, U. Chicago, at XLDB 2012)
1. Your data gathering rate is slower than
Moore’s Law.
2. Your data gathering rate matches Moore’s
Law.
3. Your data gathering rate exceeds Moore’s
Law.

“Three types of data scientists.”
1. Your data gathering rate is slower than Moore’
s Law.
=> Be lazy, all will work out.
2. Your data gathering rate matches Moore’s
Law.
=> You need to write good software, but all will
work out.
3. Your data gathering rate exceeds Moore’s Law.
=> You need serious help.

A software & algorithms approach: can we
develop lossy compression approaches that
1. Reduce data size & remove errors => efficient
processing?
2. Retain all “information”? (think JPEG)
If so, then we can store only the compressed
data for later reanalysis. Short answer is: yes,
we can.

Digital normalization approach
A digital analog to cDNA library normalization,
diginorm:
● Reference free.
● Is single pass: looks at each read only once;
● Does not “collect” the majority of errors;
● Keeps all low-coverage reads & retains all
information.

GED Lab’s approach: khmer
diginorm: ejects most data while retaining the
information content.
partitioning: split transcriptomic and meta
{transcript,gen}omic datasets
fast k-mer counting: for better preprocessing,
repeat detection, and sequencing coverage
estimates
Reference-free variant calling
- More to come -

TheGEDlabat MSU:
Theoretical => applied solutions.

Impact
● any biologist can use our tools in a rented
cloud computer, cheaply
● Overcome complexity: Erich Schwarz
assembled H. contortus when it was
previously not possible.
● Overcome data excess: 5.1 billion reads from
50 different sea lamprey tissue -> diginorm
technique removed 98.7% for being
redundant.

Future work
● targeted-gene assembly from short reads
(Fish et al., Ribosomal Database Project)
● rRNA search in shotgun data
● error-correction for mRNAseq &
metagenomic data
● strain variation collapse, assembly, and
recovery
● Goal: make most assembly easy and all
evaluation easy.

Interactions
khmer both builds upon existing Free and
Open-Source Software (F/OSS) and is itself
made under an open-source license.
used in curriculum: both Software Carpentry
ANGUS based courses and the MSU NGS
summer course

● BIG DATA grant reviewers specifically
mentioned the GED Lab’s “[...] long and
successful track-record and experience in
following rigorous but open software
development processes.” -> CTB received 3-
year NIH R01 support
● Transparent and public software
development yielded participation from
others.

Personal Acknowledgments
C. Titus Brown for slides, employment

Acknowledgements
Labmembersinvolved Collaborators
● Adina Howe (w/Tiedje)
● Jason Pell
● Arend Hintze
● Rosangela Canino-
Koning
● Qingpeng Zhang
● Elijah Lowe
● Likit Preeyanon
● Jiarong Guo
● Tim Brom
● Kanchan Pavangadkar
● Eric McDonald
● Chris Welcher
● Jim Tiedje, MSU
● Billie Swalla, UW
● Janet Jansson,
LBNL
● Susannah Tringe,
JGI
Funding
USDA NIFA; NSF
IOS; BEACON.

Weitere ähnliche Inhalte

Was ist angesagt?

Future Architectures for genomicsGuy Coates

2015 aem-grs-keynotec.titus.brown

Next generation genomics: Petascale data in the life sciencesGuy Coates

Cloud ExperiencesGuy Coates

Scientists to tap data networks' hidden powersSteve Scansaroli

Smith T Bio Hdf Bosc2008bosc_2008

Storage for next-generation sequencingGuy Coates

The Rise of Machine IntelligenceLarry Smarr

Machine Learning in Healthcare DiagnosticsLarry Smarr

2013 talk at TGAC, November 4c.titus.brown

2013 nas-ehs-data-integration-dcc.titus.brown

ADAM—Spark Summit, 2014fnothaft

Bringing bioinformatics into the libraryC. Tobin Magle

Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks

Accelerating materials design through natural language processingAnubhav Jain

2015 msu-code-reviewc.titus.brown

Pacific Research Platform Supporting Earth SciencesLarry Smarr

Creating a Science-Driven Big Data Superhighway for SIOLarry Smarr

Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella

TA-RE: An Exchange Language for Mining Software RepositoriesThomas Zimmermann

Was ist angesagt? (20)

Future Architectures for genomics

2015 aem-grs-keynote

Next generation genomics: Petascale data in the life sciences

Cloud Experiences

Scientists to tap data networks' hidden powers

Smith T Bio Hdf Bosc2008

Storage for next-generation sequencing

The Rise of Machine Intelligence

Machine Learning in Healthcare Diagnostics

2013 talk at TGAC, November 4

2013 nas-ehs-data-integration-dc

ADAM—Spark Summit, 2014

Bringing bioinformatics into the library

Drug Repurposing using Deep Learning on Knowledge Graphs

Accelerating materials design through natural language processing

2015 msu-code-review

Pacific Research Platform Supporting Earth Sciences

Creating a Science-Driven Big Data Superhighway for SIO

Spark Summit Europe: Share and analyse genomic data at scale

TA-RE: An Exchange Language for Mining Software Repositories

Ähnlich wie Better science through superior software

2013 caltech-edrn-talkc.titus.brown

Computation and KnowledgeIan Foster

Mining Big Data using Genetic AlgorithmIRJET Journal

eScience: A Transformed Scientific MethodDuncan Hull

Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsWilliam Yetman

(Em)Powering Science: High-Performance Infrastructure in Biomedical ScienceAri Berman

2014 khmer protocolsc.titus.brown

2014 nicta-reproducibilityc.titus.brown

CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts

Talk at Bioinformatics Open Source Conference, 2012c.titus.brown

2014 aus-agtac.titus.brown

Cshl minseqe 2013_ouelletteFunctional Genomics Data Society

No Free Lunch: Metadata in the life sciencesChris Dwan

2014 manchester-reproducibilityc.titus.brown

kantorNSF-NIJ-ISI-03-06-04.pptbutest

Accelerating Data-driven Discovery in Energy ScienceIan Foster

Dahlquist bosc 20160709GRNsight

Big Data Putchong Uthayopas

Slides barcelona risk dataArthur Charpentier

Pemanfaatan Big Data Dalam Riset 2023.pptxelisarosa29

Ähnlich wie Better science through superior software (20)

2013 caltech-edrn-talk

Computation and Knowledge

Mining Big Data using Genetic Algorithm

eScience: A Transformed Scientific Method

Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations

(Em)Powering Science: High-Performance Infrastructure in Biomedical Science

2014 khmer protocols

2014 nicta-reproducibility

CT Brown - Doing next-gen sequencing analysis in the cloud

Talk at Bioinformatics Open Source Conference, 2012

2014 aus-agta

Cshl minseqe 2013_ouellette

No Free Lunch: Metadata in the life sciences

2014 manchester-reproducibility

kantorNSF-NIJ-ISI-03-06-04.ppt

Accelerating Data-driven Discovery in Energy Science

Dahlquist bosc 20160709

Big Data

Slides barcelona risk data

Pemanfaatan Big Data Dalam Riset 2023.pptx

Kürzlich hochgeladen

Slack Application Development 101 Slidespraypatel2

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

A Year of the Servo Reboot: Where Are We Now?Igalia

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Scaling API-first – The story of a global engineering organizationRadu Cotescu

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

A Call to Action for Generative AI in 2024Results

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Kürzlich hochgeladen (20)

Slack Application Development 101 Slides

A Domino Admins Adventures (Engage 2024)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

A Year of the Servo Reboot: Where Are We Now?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Scaling API-first – The story of a global engineering organization

CNv6 Instructor Chapter 6 Quality of Service

Exploring the Future Potential of AI-Enabled Smartphone Processors

A Call to Action for Generative AI in 2024

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

08448380779 Call Girls In Friends Colony Women Seeking Men

GenCyber Cyber Security Day Presentation

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Breaking the Kubernetes Kill Chain: Host Path Mount

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Data Cloud, More than a CDP by Matt Robison

Presentation on how to chat with PDF using ChatGPT code interpreter

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Better science through superior software

1. Better science through superior software Michael R. Crusoe Software Engineer & Bioinformatician The GED Lab @ Michigan State mcrusoe@msu.edu @biocrusoe

2. Open, online science Much of the software and approaches talked about today are available: khmer software: http://github.com/ged-lab/khmer/ Titus’s blog: http://ivory.idyll.org/blog/ Titus’s twitter: @ctitusbrown

3. Overview ● Next-gen sequencing data deluge ● ♫How do you solve a problem like big data?♫ ● Impact of khmer software ● Future work ● Being a good F/OSS community member and leading by example ● Acknowledgements

4. Problem “The power of next-gen. sequencing: get 180x coverage... and then watch your assemblies never finish” - Erich Schwarz

5. “Three types of data scientists.” (Bob Grossman, U. Chicago, at XLDB 2012) 1. Your data gathering rate is slower than Moore’s Law. 2. Your data gathering rate matches Moore’s Law. 3. Your data gathering rate exceeds Moore’s Law.

7. “Three types of data scientists.” 1. Your data gathering rate is slower than Moore’ s Law. => Be lazy, all will work out. 2. Your data gathering rate matches Moore’s Law. => You need to write good software, but all will work out. 3. Your data gathering rate exceeds Moore’s Law. => You need serious help.

8. A software & algorithms approach: can we develop lossy compression approaches that 1. Reduce data size & remove errors => efficient processing? 2. Retain all “information”? (think JPEG) If so, then we can store only the compressed data for later reanalysis. Short answer is: yes, we can.

9. Digital normalization approach A digital analog to cDNA library normalization, diginorm: ● Reference free. ● Is single pass: looks at each read only once; ● Does not “collect” the majority of errors; ● Keeps all low-coverage reads & retains all information.

10. GED Lab’s approach: khmer diginorm: ejects most data while retaining the information content. partitioning: split transcriptomic and meta {transcript,gen}omic datasets fast k-mer counting: for better preprocessing, repeat detection, and sequencing coverage estimates Reference-free variant calling - More to come -

11. TheGEDlabat MSU: Theoretical => applied solutions.

12. Impact ● any biologist can use our tools in a rented cloud computer, cheaply ● Overcome complexity: Erich Schwarz assembled H. contortus when it was previously not possible. ● Overcome data excess: 5.1 billion reads from 50 different sea lamprey tissue -> diginorm technique removed 98.7% for being redundant.

13. Future work ● targeted-gene assembly from short reads (Fish et al., Ribosomal Database Project) ● rRNA search in shotgun data ● error-correction for mRNAseq & metagenomic data ● strain variation collapse, assembly, and recovery ● Goal: make most assembly easy and all evaluation easy.

14. Interactions khmer both builds upon existing Free and Open-Source Software (F/OSS) and is itself made under an open-source license. used in curriculum: both Software Carpentry ANGUS based courses and the MSU NGS summer course

15. ● BIG DATA grant reviewers specifically mentioned the GED Lab’s “[...] long and successful track-record and experience in following rigorous but open software development processes.” -> CTB received 3- year NIH R01 support ● Transparent and public software development yielded participation from others.

16. Personal Acknowledgments C. Titus Brown for slides, employment

17. Acknowledgements Labmembersinvolved Collaborators ● Adina Howe (w/Tiedje) ● Jason Pell ● Arend Hintze ● Rosangela Canino- Koning ● Qingpeng Zhang ● Elijah Lowe ● Likit Preeyanon ● Jiarong Guo ● Tim Brom ● Kanchan Pavangadkar ● Eric McDonald ● Chris Welcher ● Jim Tiedje, MSU ● Billie Swalla, UW ● Janet Jansson, LBNL ● Susannah Tringe, JGI Funding USDA NIFA; NSF IOS; BEACON.

Better science through superior software

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Better science through superior software

Ähnlich wie Better science through superior software (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Better science through superior software