1. Data-intensive approaches to
investigating non-model
organisms
C. Titus Brown
ctb@msu.edu
Assistant Professor
Microbiology and Molecular Genetics; Computer Science and Engineering;
BEACON; Quantitative Biology Initiative
2. Outline
• My research!
• Opportunities for computational science training
• More unsolicited advice
3. Acknowledgements
Lab members involved Collaborators
• Adina Howe (w/Tiedje)
• Jason Pell
• Arend Hintze
• Rosangela Canino-Koning
• Qingpeng Zhang
• Elijah Lowe
• Likit Preeyanon
• Jiarong Guo
• Tim Brom
• Kanchan Pavangadkar
• Eric McDonald
• Jim Tiedje, MSU
• Erich Schwarz, Caltech / Cornell
• Paul Sternberg, Caltech
• Robin Gasser, U. Melbourne
• Weiming Li
• Hans Cheng
Funding
USDA NIFA; NSF IOS;
BEACON; NIH.
4. My interests
I work primarily on organisms of agricultural, evolutionary, or
ecological importance, which tend to have poor reference
genomes and transcriptomes. Focus on:
• Improving assembly sensitivity to better recover
genomic/transcriptomic sequence, often from “weird”
samples.
• Scaling sequence assembly approaches so that huge
assemblies are possible and big assemblies are
straightforward.
• “Better science through superior software”
5. There is quite a bit of life left to sequence & assemble.
http://pacelab.colorado.edu/
6. “Weird” biological samples:
• Single genome
• Transcriptome
• High polymorphism data
• Whole genome amplified
• Metagenome (mixed
microbial community)
• Hard to sequence DNA
(e.g. GC/AT bias)
• Differential expression!
• Multiple alleles
• Often extreme
amplification bias
• Differential abundance
within community.
11. New problem: data analysis &
integration!
• Once you can generate virtually any data set you want…
• …the next problem becomes finding your answer in the data
set!
• Think of it as a gigantic NSA treasure hunt: you know there are
terrorists out there, but to find them you to hunt through 1 bn
phone calls a day…
12. “Heuristics”
• What do computers do when the answer is either really, really
hard to compute exactly, or actually impossible?
• They approximate! Or guess!
• The term “heuristic” refers to a guess, or shortcut
procedure, that usually returns a pretty good answer.
14. My actual research focus
What we do is think about ways to get computers to play chess
better, by:
• Identifying better ways to guess;
• Speeding up the guessing process;
• Improving people’s ability to use the chess playing computer
Now, replace “play chess” with
“analyze biological data”...
15. My actual research focus…
We build tools that help experimental biologists work efficiently
and correctly with large amounts of data, to help answer their
scientific questions.
This touches on many problems, including:
• Computational and scientific correctness.
• Computational efficiency.
• Cultural divides between experimental biologists and
computational scientists.
• Lack of training (biology and medical curricula devoid of math
and computing).
24. Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
• Is single pass: looks at each read only once;
• Does not “collect” the majority of errors;
• Keeps all low-coverage reads;
• Smooths out coverage of regions.
30. Raw data
(~10-100 GB)
Analysis "Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Restated:
Can we use lossy compression approaches to make
downstream analysis faster and better? (Yes.)
~2 GB – 2 TB of single-chassis RAM
31. Soil metagenome assembly
• Observation: 99% of microbes cannot easily be cultured in the
lab. (“The great plate count anomaly”)
• Many reasons why you can’t or don’t want to culture:
• Syntrophic relationships
• Niche-specificity or unknown physiology
• Dormant microbes
• Abundance within communities
Single-cell sequencing & shotgun metagenomics are two common
ways to investigate microbial communities.
32. Investigating soil microbial ecology
• What ecosystem level functions are present, and how do
microbes do them?
• How does agricultural soil differ from native soil?
• How does soil respond to climate perturbation?
• Questions that are not easy to answer without shotgun
sequencing:
• What kind of strain-level heterogeneity is present in the
population?
• What does the phage and viral population look like?
• What species are where?
35. Putting it in perspective:
Total equivalent of ~1200 bacterial genomes
Human genome ~3 billion bp
Assemblyresults for Iowacorn and prairie
(2x~300Gbpsoilmetagenomes)
Total
Assembly
Total Contigs
(> 300 bp)
% Reads
Assembled
Predicted
protein
coding
2.5 bill 4.5 mill 19% 5.3 mill
3.5 bill 5.9 mill 22% 6.8 mill
Adina Howe
37. Tentative observations from our
soil samples:
• We need 100x as much data…
• Much of our sample may consist of phage.
• Phylogeny varies more than functional predictions.
• We see little to no strain variation within our samples
• Not bulk soil --
• Very small, localized, and low coverage samples
• We may be able to do selective really deep sequencing and
then infer the rest from 16s.
• Implications for soil aggregate assembly?
38. I also work on…
• Genome assembly & analysis
• Transcriptome assembly and analysis
• Interpretation of annoying large data sets
40. Training opportunities
• PLB/MMG 810 (Shiu; ??)
• CSE 801/Intro BEACON course (Brown; FS ‘13)
“Intro to Computational Science for Evolutionary Biologists”
• CSE 801 bootcamp (late Sep)
• Software Carpentry bootcamp(s) (late Sep)
• Workshops in Applied Bioinformatics (Buell; ‘14?)
• Next-Gen Sequence Analysis Workshop (Brown; summer ‘14)
+ a variety of genomics courses that I can’t keep track of!
Becky Mansel will have these slides.
41. Unsolicited advice
Consider both faculty and non-faculty careers.
• It’s a bad time to be looking for faculty positions, and it’s a bad
time to be looking for funding; maybe this will improve in 10
years, maybe not.
• A PhD qualifies you for many, many more things than we will
(or can) tell you about!
• Specific advice:
• Network with industry folk; think beyond your advisor’s career.
• Write a blog: ivory.idyll.org/blog/advice-to-scientists-on-
blogging.html
Editor's Notes
Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
Diginorm is a subsampling approach that may help assemble highly polymorphic sequences. Observed levels of variation are quite low relative to e.g. marine free spawning animals.