Verification of thevenin's theorem for BEEE Lab (1).pptx
Job Talk Iowa State University Ag Bio Engineering
1. RIDING THE BIG DATA
TIDAL WAVE IN
MODERN
MICROBIOLOGY
IOWA STATE UNIVERSITY
MARCH 12, 2014
Adina Howe, PhD
2. Outline of talk
My multi-discipline career
Biological sequencing: a game changer
Research – computational focus:
How to handle “big data” in biology
Research – biological focus:
The gut microbiome’s role in obesity?
Future research:
A flexible toolbox in a big playground
5. Background
Purdue University, BSME,
Mechanical Engineering
Purdue University, MS,
Environmental Engineering
(Sustainability)
University of Iowa, PhD,
Environmental Engineering
(Microbiology/Bioremediatio
n)
Michigan State University
NSF Postdoc Math and Biology Fellow (cross-
training)
Microbial Ecology (Jim Tiedje)
Bioinformatics (Titus Brown)
6. Background
Purdue University, BSME,
Mechanical Engineering
Purdue University, MS,
Environmental Engineering
(Sustainability)
University of Iowa, PhD,
Environmental Engineering
(Microbiology/Bioremediatio
n)
Michigan State University
NSF Postdoc Math and Biology Fellow (cross-
training)
Microbial Ecology (Jim Tiedje)
Bioinformatics (Titus Brown)
Computational Biologist
Microbiology / Microbial Ecology
7. Our shared challenges
Climate Change
Energy Supply
USGCRP 2009
www.alutiiq.com
http://guardianlv.com/
Human Health
An understanding
of microbial ecology
10. Gene / Genome Sequencing
Collect samples
Extract DNA
Sequence DNA
“Analyze” DNA to identify its content and origin
Taxonomy
(e.g., pathogenic E. Coli)
Function
(e.g., degrades cellulose)
11. Cost of Sequencing
Stein, Genome Biology, 2010
E. Coli genome 4,500,000 bp ($4.5M, 1992)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DNASequencing,Mbpper$
10,000,000
100,000,000
12. Rapidly decreasing costs with
NGS Sequencing
Stein, Genome Biology, 2010
Next Generation Sequencing
4,500,000 bp (E. Coli, $200, presently)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DNASequencing,Mbpper$
10,000,000
100,000,000
13. Effects of low cost
sequencing…
First free-living bacterium sequenced
for billions of dollars and years of
analysis
Personal genome can be
mapped in a few days and
hundreds to few thousand
dollars
15. The era of big data in biology
Stein, Genome Biology, 2010
Computational Hardware
(doubling time 14 months)
Sanger Sequencing
(doubling time 19 months)
NGS (Shotgun) Sequencing
(doubling time 5 months)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0
1
10
100
1,000
10,000
100,000
1,000,000
DiskStorage,Mb/$
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DNASequencing,Mbpper$
10,000,000
100,000,000
0.1
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
16. Postdoc experience with data
2003-2008 Cumulative sequencing in PhD = 2000 bp
2008-2009 Postdoc Year 1 = 50 Gbp
2009-2010 Postdoc Year 2 = 450 Gbp
17. Flexibility towards embracing change.
How to survive a data
deluge?
Experimen
t
Design
Data
Generatio
n
Workflow /
Tools
Data
analysis
Applied
Solutions
19. de novo assembly
Compresses dataset size significantly
Improved data quality (longer sequences, gene order)
Reference not necessary (novelty)
Raw sequencing data (“reads”) Computational algorithms Informative genes / genome
21. Shotgun sequencing and de novo
assembly
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness
22. Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computer
23. Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computer
Assembly of 300 Gbp can be
done with any assembly program
in less than 14 GB RAM and less
than 24 hours.
25. Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x
26. Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x Sample 10x
27. Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x Sample 10x
Overkill
33. Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Scales datasets for assembly up to 95% - same assembly
outputs.
Genomes, mRNA-seq, metagenomes (soils, gut, water)
34. Partitioning (khmer software)
Pell et al, 2012, PNAS
Howe et al., 2014, PNAS
Separates metagenomes by species
Parallel computing possible
Largest known published soil metagenome and assembly
38. The health and stability of the gut
microbiome (in response to diet change)
University of Chicago: Daina Ringus, PhD & Eugene Chang, MD38
Experimen
t
Design
Data
Generatio
n
Workflow /
Tools
Data
analysis
Applied
Solutions
40. Interactions between the
microbiome and the environment
40
Source: Zhao, 2013
Obesity
Intestinal inflammation
IBD diseases
Diet has a greater
potential to shape the
structure and function of
gut than host genetics.
Direct influence on health
state
41. How resilient is the microbiome?
41
In mice, recovery from long term shift to obesity-inducing diet
In humans, microbiome rapidly and reproducibly recovers within 2 days (2013)
In mice, rapid recovery from long term shift to obesity-inducing diet (2012)
42. Is the gut community going viral?
Reyes et al, Nature Review Microbiology, 2012
42
Bacterial cells Bacterial cells infected
with bacteriophage
Viruses (Bacteriophage)
Vary by individual (Minot et al., 2011)
Altered by diet and co-vary with bacteria (Minot et al., 2011)
Long term stable (Minot et al., 2013)
Largely temperate (Reyes et al., 2013)
Prophage
Who is in the gut microbiome?
43. Is the gut community going viral?
Reyes et al, Nature Review Microbiology, 2012
43
44. Is the gut community going viral?
Reyes et al, Nature Review Microbiology, 2012
44
45. Is the gut community going viral?
Reyes et al, Nature Review Microbiology, 2012
45
46. Research Questions
46
What are the impacts of different diets on gut
microbiome response?
What are the impacts of viruses in the gut
microbiome (rapid alteration and resilient
response?)
Multidisciplinary approach combining
novel experimental targeting of both bacterial and viral
communities
metagenomic-based sequencing to characterize
community
47. Novel experimental design – targeted
sampling of community fractions
I. Total DNA (bacteria + prophage + viruses) TOT
II. Virus-like particles
(free-living viruses)
VLP
III. Induced prophage
IND
47
Separation
by density
Chemically
separate
Separation
by size
Microbiome through
faecal matter (non
destructive sampling)
48. Two baseline diets (with a
perturbation)
Low-fat (LF) baseline diet
Milk-fat (MF) baseline diet
Age (wk)
4 5 6 7 8 9 10 11 12 13 14
Diet Switch Washout (Return to BaselinBaseline
Total community function: TOT metagenomic sequencing at weeks 8, 11, 14
Virome community function: VLP, IND metagenomic sequencing at weeks 8, 11, 14
Weight of mice and count of VLPS with microscopy
Taxonomy analysis (only 16S rRNA gene) every week from week 8 – 14.
48
LF / 10% Fat / Complex Carbs
MF / 37% Fat / Simple Sugars
MF
LF MF
LF
Fecal Samples
49. Outcomes?
49
Low-fat (LF) baseline diet
Milk-fat (MF) baseline diet
Age (wk)
4 5 6 7 8 9 10 11 12 13 14
Diet Switch Washout (Return to BaselinBaseline
LF / 10% Fat / Complex Carbs
MF / 37% Fat / Simple Sugars
MF
LF MF
LF
Qualitative and Quantitative Measurements:
Who is there? What are they doing?
How much?
50. How does the community change
over time?
DistancefromBaseline
Baseline Intervention Washout
DistancefromBaseline
Baseline Intervention Washout
Altered-Recovery Altered-Altered
Measurements of gene abundance profile
(200,000+ genes) reduced to a single
distance measurement from the original
community (ordination)
Baseline Intervention Washout
No Change
DistancefromBaseline
51. Rapid and resilient bacterial gut
response after diet alteration
DistancefromBaseline
***
Baseline Intervention Washout
54. Prophages in MF baseline are
significantly altered without
recovery.54
0.00.10.20.3
DistancefromBaseline
Baseline Diet Perturbed Washout
55. “Combat Zone” as diets change
Milk-fat baseline (MF) mice have contrasting bacterial and viral responses, in
which there is not a rapid recovery of viral communities
56. Viral functions significantly
changed during the milk fat
baseline diet56
Decreases in
Phage-related (p=0.01)
Iron acquisition (p<0.01)
Nucleotide metabolism (p=0.02)
Carbohydrate metabolism (p=0.01)
Motility and chemotaxis (p=0.03)
Virulence and defense (p=0.03)
Phage Iron
Nucleotide Carbs
Baseline - Change -- Washout
Flagella
57. 57
Bacteroides (Bacterioidetes)
Clostridium (Firmucutes)
Eubacterium (Firmucutes)
Significant decrease in genes
associated with MF baseline viruses
Ratio of Firmucutes and
Bacterioidetes associated with
obesity
Turnbaugh, 2008
Bacteriodes fragilis, Nutridesk.com C. difficile, Bioquell.ie National Geographic
Turnbaugh, 2009
58. Viromes potentially critical in gut
microbiome response.
Members of gut microbiome community do not
have co-occuring responses.
Loss of viral population and diversity is diet
specific (related to a milkfat to lowfat diet
transition)
59. Ability to redirect structure and function of
microbiome makes them pivotal drivers of health and
disease
Reyes et al, Nature Review Microbiology, 2012
59
60. Virome directly causes host response
Germ Free 11 week old mice (n = 3)
Diet: Standard chow
3 week conventionalization
60
A “standard control”
Microbiome:
Uniform cecal content
of standard chow
mice
Experimentally
introduced viruses
Mouse Treatment I:
Lowfat baseline
VLP
Mouse Treatment
2: Milkfat baseline
VLP
Control: Buffer
61. Significant decrease of intestinal
inflammation in LF VLP treatments61
Pro-inflammatory cytokines in mucosal scrapings
TNF-α INF-γ
Proximal colon
TNF-alpha(ng/gl)
C
ontrol
LF
VLPs
M
F
VLPs
0
5
10
15
Proximal colon
INF-gamma(ng/g)
C
ontrol
LF
VLPs
M
F
VLPs
0
10
20
30
*
62. Conclusions
Gut microbiome has reproducible and distinct
responses to diet.
Viruses have a unique response to diet
perturbations and do not co-occur with bacteria.
Viruses observed to cause inflammation in
infected germ free mice.
Big data workflow enabled strategic sampling
design providing unparalleled access to
viruses of gut microbiome
62
66. Core research values
Research that matters
Developing scientific frameworks that enable
open-science initiatives (reproducible science)
Computational and experimental integration
Scale and power to multi-disciplinary
approaches
Team value
Flexibility
67. Going viral: The role of the human gut
phageome in inflammatory bowel disease
Objectives:
Define and compare core phageomes
associated with healthy and diseased
gut microbiomes
Determine impact of disease-associated
gut phageomes on development of
disease in knockout mouse models
(predisposed to disease)
NIH, National Institute of Diabetes and Digestive and
Kidney Diseases; National Institute of Allergy and Infectious
Diseases ($3-5M)
Source: Nature.com
What is the role of host-phage
dynamics in the development of
intestinal diseases?
Integration of multiple datasets
Improved model systems and
biomarkers
68. Microbial drivers of carbon metabolism and
warming
DOE Biological and Environmental
Research ($3M/3 years, 40% PI with
ISU Kirsten Hofmockel, 2013-2016)
Source: Oakridge National LaboratoryContributions:
• Omic-based characterization of carbon cycling microorganisms
in the soil
• Novel approaches to target carbon cycling subsets of
community
• Improved soil genomic databases to enable future carbon
studies
Source: Oakridge National LaboratoryHow do microbes contribute to
carbon cycling models?
Big data scaling
Integration of multiple
datasets
69. Large-scale characterization of global dark
matter proteins in complex biological
environments
NIH – Development of Software and Analysis Methods for Biomedical
Big Data in Targeted Areas of High Need
(~$1M/3 years)
Gordon and Betty Moore – Data Driven Discovery Investigator Awards
($1.5M / 5 years)
Novel extension of current software tools:
• Integration of growing volumes of global public datasets with scalable
data-mining analysis
• Lightweight data architecture to compare abundance and co-
occurrence of sequencing patterns across multiple samples and
associated metadata to elucidate information
How do we access the novelty observed in metagenomic dataset
Big data scaling
Integration of datasets
70. From field to food: The origin and
fate of our microbiomes
USDA Agriculture and Food Research Initiative ($1-
2.5M)
• Identify and characterize under-
researched foodborne microbial hazards
and effective control strategies
• Elucidate fate and dissemination of
foodborne microbial hazards associated
with produce production and processing Source: aboretum.umn.edu
Where do harmful microbes in our food come
from and how do we protect ourselves from
them?
Integration of multiple datasets
Improved model systems and
71. Acknowledgements
Funding
DOE Microbial Carbon Cycling Grant
NSF Postdoc Fellowship, Great Lakes Bioenergy
Research Center
Microbiome: University of Chicago Digestive Diseases
Research Core Pilot and Feasibility Grant
My Awesome INTER-DISCIPLINARY Team
C. Titus Brown (MSU) + lab (Bioinformatics)
James Tiedje (MSU) + lab (Microbial Ecology)
Daina Ringus (UC) (Microbiology / Mice)
Kirsten Hofmockel, Ryan Williams, Fan Yang (ISU)
Eugene Chang (UC)
Folker Meyer (ANL)
71
73. Reducing data, not information.
More efficient data storage and mining.
Big data scaling approaches
74. Storage of biological big data
What other sequences are connected to
Sequence X?
Data broken into words of length “k” (k-mers)
Overlap (for assembly) = shared “word”
Pell, PNAS, 2014
Howe, PNAS,
AGTCAGTT
Into its 4-mers:
AGTC
GTCA
TCAG
CAGT
AGTT
AGAAAGTC
Into its 4-mers:
AGAA
GAAA
AAAG
CAGT
AGTC
75. Storage of biological big data
What other sequences are connected to
Sequence X?
Data broken into words of length “k” (k-mers)
Overlap (for assembly) = shared “word”
How do we store “big data” words?
Bloom filter data structure
Efficient storage
76. Do I have mail?
What other sequences are connected to
Sequence X?
Data broken into bins of word length “k” (k-mers)
Overlap (for assembly) = shared “word”
How do we store “big data” words?
Bloom filter data structure
Mailbox analogy
A-G H-R S-Z
Pell, PNAS, 2014
Howe, PNAS,
77. Is Sequencing A connected to Sequence B?
Data broken into bins of word length “k” (k-mers)
Overlap (for assembly) = shared “word”
How do we store “big data” words?
Bloom filter data structure
Mailbox analogy – Efficient storage of information
A-G H-R S-Z
A-G* H-R S-Z
No mail for Howe, 100% sure.
A-G H-R* S-Z
Possibly mail for Howe.
Pell, PNAS, 2014
Howe, PNAS,
Do I have mail?
78. Is Sequencing A connected to Sequence B?
Data broken into bins of word length “k” (k-mers)
Overlap (for assembly) = shared “word”
How do we store “big data” words?
Bloom filter data structure
Mailbox analogy – Efficient storage of information
A-G H-R S-Z
A-G H-R* S-Z
G-N* A-F; O-T U-Z
D-H* A-C; I-O P-Z
Howe mail status:
Mail possibility higher.
Do I have mail?
79. Is Sequencing A connected to Sequence B?
Data broken into bins of word length “k” (k-mers)
Overlap (for assembly) = shared “word”
How do we store “big data” words?
Bloom filter data structure
Mailbox analogy – Efficient storage of information
A-G H-R S-Z
A-G H-R* S-Z
G-N* A-F; O-T U-Z
D-H A-C; I-O P-Z
Howe mail status:
No mail, 100% sure.
Do I have mail?
80. Bloom filter data structure
“Probablistic” data structure
Decrease of false positive rate with multiple
bloom filters – “More likely I have mail”
No false negatives – “No mail. 100% sure”
For the win: both detects and counts presence
of sequences (k-mers) and their connectivity
efficiently
Is sequence A connected to sequence B?
Pell, PNAS, 2014
Howe, PNAS,
Hinweis der Redaktion
Hi, thanks for inviting me to talk to you today and taking the time to come learn a little bit of my research. I’ll admit that this is one of the longer talks I’ve ever given. I’ve had only one other 90 minute talk and it was to a group of Korean government officials who were interested in what a framework for big data analysis for a community might look like. But since they did not speak English, half of that talk was given by a translator.
Today, I’m going to give you an overview of my research which is very much interdisciplinary, living on the edge of both computational biology and microbial ecology (the study of natural communities in the environment). I’m going to tell you a bit about my background and what shaped the research I’ve become involved in. Then I’m going to highlight a couple research projects – the first will be more of a computational focus where I tell you about research that tackled the “data deluge” that emerged from fast changing sequencing technologies. Then I’ll tell you a story of how we used these tools on data to investigate how our bodies (which our in themsleves a natural ecosystem) respond to dietary changes? Then finally, I’ll conclude with a discussion on where I view these efforts going in the future.
Folks often ask me how I went from a Mechanical Engineering degree to microbial ecology, as its not the most conventional track. And I actually think if you talk to most people in the field now, many of them have arrived here in unconventional paths. As a ME at Purdue, I had the opportunity to do two internships where I looked at the environmental impacts of industrial machinery. In particular, my junior year, I worked at Exxon Mobil evaluating sustainable replacments for outdated compressors on an oil platform which pump the unrefined oil from the platform to the refinery. This got me very interested in understanding the environmental impacts of economic decisions and how we should evaluate them, and this brought me into a program that was in its first year at Purdue in Environmental Engineering focusing on sustainability research. It was here where I first learned about microbiology and how it impacts our lives in so many different ways – visiting a wastewater treatment plant was a bit of a life changing moment for me. I really fell in love with the natural ability of these invisbile lifeforms in creating the world around us. So I went to grad school at the University of Iowa where I worked on how to monitor the activity of microbes which degrade pollutants in both groundwater and soil and one of the main challenges I had was that we were always workign with “model organisms” which didn’t necessarily match what was in the natural environment. After my PhD, the field of metagenomes was in its infancy but was being touted as a huge opportunity for studying natural environments. Jim Tiedje at MSU needed someone to start working with this sort of data, and thought I had no experience in it, I was willing to give it a try, and with the support of the NSF and Titus Brown at MSU, that’s what we did. I must’ve done a decent job of it because I was recruited by Argonne National Lab a couple years ago to provide support for some projects ongoing locally there.
Folks often ask me how I went from a Mechanical Engineering degree to microbial ecology, as its not the most conventional track. And I actually think if you talk to most people in the field now, many of them have arrived here in unconventional paths. But something that I think is shared is that you pursue research that is enabling and that can make a real difference. As a ME at Purdue, I had the opportunity to do two internships where I looked at the environmental impacts of industrial machinery. In particular, my junior year, I worked at Exxon Mobil evaluating sustainable replacments for outdated compressors on an oil platform which pump the unrefined oil from the platform to the refinery. This got me very interested in understanding the environmental impacts of economic decisions and how we should evaluate them, and this brought me into a program that was in its first year at Purdue in Environmental Engineering focusing on sustainability research. It was here where I first learned about microbiology and how it impacts our lives in so many different ways – visiting a wastewater treatment plant was a bit of a life changing moment for me. I really fell in love with the natural ability of these invisbile lifeforms in creating the world around us. So I went to grad school at the University of Iowa where I worked on how to monitor the activity of microbes which degrade pollutants in both groundwater and soil and one of the main challenges I had was that we were always workign with “model organisms” which didn’t necessarily match what was in the natural environment. After my PhD, the field of metagenomes was in its infancy but was being touted as a huge opportunity for studying natural environments. Jim Tiedje at MSU needed someone to start working with this sort of data, and thought I had no experience in it, I was willing to give it a try, and with the support of the NSF and Titus Brown at MSU, that’s what we did. I must’ve done a decent job of it because I was recruited by Argonne National Lab a couple years ago to provide support for some projects ongoing locally there.
Folks often ask me how I went from a Mechanical Engineering degree to microbial ecology, as its not the most conventional track. And I actually think if you talk to most people in the field now, many of them have arrived here in unconventional paths. But something that I think is shared is that you pursue research that is enabling and that can make a real difference. As a ME at Purdue, I had the opportunity to do two internships where I looked at the environmental impacts of industrial machinery. In particular, my junior year, I worked at Exxon Mobil evaluating sustainable replacments for outdated compressors on an oil platform which pump the unrefined oil from the platform to the refinery. This got me very interested in understanding the environmental impacts of economic decisions and how we should evaluate them, and this brought me into a program that was in its first year at Purdue in Environmental Engineering focusing on sustainability research. It was here where I first learned about microbiology and how it impacts our lives in so many different ways – visiting a wastewater treatment plant was a bit of a life changing moment for me. I really fell in love with the natural ability of these invisbile lifeforms in creating the world around us. So I went to grad school at the University of Iowa where I worked on how to monitor the activity of microbes which degrade pollutants in both groundwater and soil and one of the main challenges I had was that we were always workign with “model organisms” which didn’t necessarily match what was in the natural environment. After my PhD, the field of metagenomes was in its infancy but was being touted as a huge opportunity for studying natural environments. Jim Tiedje at MSU needed someone to start working with this sort of data, and thought I had no experience in it, I was willing to give it a try, and with the support of the NSF and Titus Brown at MSU, that’s what we did. I must’ve done a decent job of it because I was recruited by Argonne National Lab a couple years ago to provide support for some projects ongoing locally there.
Folks often ask me how I went from a Mechanical Engineering degree to microbial ecology, as its not the most conventional track. And I actually think if you talk to most people in the field now, many of them have arrived here in unconventional paths. But something that I think is shared is that you pursue research that is enabling and that can make a real difference. As a ME at Purdue, I had the opportunity to do two internships where I looked at the environmental impacts of industrial machinery. In particular, my junior year, I worked at Exxon Mobil evaluating sustainable replacments for outdated compressors on an oil platform which pump the unrefined oil from the platform to the refinery. This got me very interested in understanding the environmental impacts of economic decisions and how we should evaluate them, and this brought me into a program that was in its first year at Purdue in Environmental Engineering focusing on sustainability research. It was here where I first learned about microbiology and how it impacts our lives in so many different ways – visiting a wastewater treatment plant was a bit of a life changing moment for me. I really fell in love with the natural ability of these invisbile lifeforms in creating the world around us. So I went to grad school at the University of Iowa where I worked on how to monitor the activity of microbes which degrade pollutants in both groundwater and soil and one of the main challenges I had was that we were always workign with “model organisms” which didn’t necessarily match what was in the natural environment. After my PhD, the field of metagenomes was in its infancy but was being touted as a huge opportunity for studying natural environments. Jim Tiedje at MSU needed someone to start working with this sort of data, and thought I had no experience in it, I was willing to give it a try, and with the support of the NSF and Titus Brown at MSU, that’s what we did. I must’ve done a decent job of it because I was recruited by Argonne National Lab a couple years ago to provide support for some projects ongoing locally there.
There are several grand challenges that our society is currently facing which I think are of paramount importance. These are predicting and managing the impacts of climate change, finding sustainable sources of liquid fuels, and understanding the emerging pandemics facing human health in recent years. From carbon emissions from land use (which is magnitudes more than that of car emissions), degrading cellulosic biomass, to pathogens in our bodies, microbes are involved in complex communities that drive the health and productivity of either our natural resources or our own bodies. And its buidling up the expertise to ask
My research explores these complex communities. These microbial communities are all connected, the food we eat contains microbes which then we “introduce into the environment” (mainly through wastewater treatment”), and then these and other microbes then impact biogeochemical cycling which affects the global climate cycle and the flow of nutrients in natural systems. As I talk about my research today, then I want to be sure to emphasize that they are broadly applicable.
We’ve known about the importance of these environmental microbes for a long time, and much research has been spent answering three seemingly simple questions. One of the reasons this has been historically so challenging is something known as the great plate count anomoly. We know that there is a diverse world of microbes out there, but when we go into the laboratory and try to study their characteristics, we cannot grow them.
When first automated DNA sequencing machines came online in the late 80s, microbiologists had a new way to ask questions like who is there and what are they doing? If we could access the microbe that we were interestd in, we could extract and sequence its DNA. We could then compare this DNA to previously seen DNA, and we could then identify the “Who” and “What” assuming and add to the encyclopedia of genes we had information about.
Sequencing opened up the door to start building a catalog of some observed key microbial players. Iit was expensive but effective. Many of the choices of who got sequenced was driven here mostly by health and biotech. And this is the same set of reference genes that are still in use today. This graph here shows how expensive sequencng was in the early 90s, and how with early sequencing technologies this cost has changed over the years.
What changed the field was the invention of next generation technlogies, bascially allowing the throughput of these automated sequencers to be much higher and the cost of sequencing much cheaper. So cheap in fact that instead of sequencing only one bacteria you could start sequencing multiple, even bacteria from complex environments.
You may have seen this in the news and recently highlighted in NPR under the subject of personalized medicine and how it getting to the point where we can all have our genomes sequenced as a baseline for our future health.
This sequencing also opens up the door to start studyig not only single isolates, but all the organims in a natural system. So then the question is not only who is there and what they are doing? But what are they doing together and how?
With this growth and opportunity, however, has come other challenges. we are now dealing with though is that the growht of sequencing technology is growing more rapidly than the computers used to even the store the data on, let alone the types of analysis that we need to make this data informative.
25x million times….And this is when I started my postdoc. To give some very concrete examples. Within the first year of my postdoc, the data I had to analyze grew from the largest known soil metagenome (a collection of environmental DNA sequences) at 50 million reads to about 40x that within literally 9 months. At that time, we were already overwhelmed with this much data. And to put this in the perspective of other datasets that were avilable at the time.
So where as I had spent a lot of time learning about how to grow bacteria and design an experiment during my PhD, now I was faced with an experiment that was designed to collect a lot of essential data but no way to start analyzing the data simply because available tools wouldn’t work.
One of the most effective ways to reduce genomic sequencing data is do something which is referred to as genomic or in our case metagenomic assembly.
Assembly is the process of rebuilding the original genome from the fragments of sequences we get from a sequencing machine. Essentially, its solving a puzzle where you look for overlaps of sequences among shredded information to predict a consensus sequence. If you do this process without any previous information (without a guided reference), you would call this process de novo assembly. Assembly has several advantages.
I want to emphasize that the difference between a single genome assembly (like that of a pure culture) vs metagenomic assembly (like that of DNA from a complex environment of soil) is a huge difference of scale. And with this difference, come multiple challenges.
Again, assembly is the process of trying to come up with a consensus sequence based on finding overlaps in small fragments. Here is an example of how “an assembly” of a sampling of the novel “A Tale of two cities”. You’ll notice here that because you have enough sampling of this sentence, you can get a good guess of what the original information would look like. You’ll also notice that there are some obstacles in getting the right solution, there are mistakes in the sampling which is analagous to sequencing errors that you would have to decide some criteria to estimate.
In this example, we are coming up with a solution of one sentence using 8 fragments. In metagenomic assembly, you are trying to come up with hundreds to thousands to even millions of genomes using billions of fragments. And to do this, you have to compare each fragment to every other one in the dataset, making it very computationally intensive.
To give you an idea of what computational intensive means, even the smallest dataset that I had at the beginning of my postdoc required several months on a supercomputer, something having over 100 GB of RAM. These were resources I simply didn’t have at this time. And for my larger datasets, there was simply nothing I could do with them, they would essentially crash any available assembly program that existed.
So I had to come up with a way to deal with all of this data or essentially, there were a handful of Pis that had just invested tens of thousands of dollars in a project where we couldn’t tractably handle the datasets.
To give you an idea of what computational intensive means, even the smallest dataset that I had at the beginning of my postdoc required several months on a supercomputer, something having over 100 GB of RAM. These were resources I simply didn’t have at this time. And for my larger datasets, there was simply nothing I could do with them, they would essentially crash any available assembly program that existed.
So I had to come up with a way to deal with all of this data or essentially, there were a handful of Pis that had just invested tens of thousands of dollars in a project where we couldn’t tractably handle the datasets.
I’m going to tell you now about how we wee able to do this and there actually two different strategies we had to combine.
So one of the first things we thought about is what makes natural communities different than single organisms and there are two main factors. One is that natural communities are diverse. There are multiple genomes, and even potentially millions of species, in a sample. And this is represented here by the presence of red, blue, and green organisms. Another main difference is that these organisms are present at a variable abundance in nature, some are highly abundant some are not.
Firstly, let me acknowledge that assembly for single organisms (especially bacterial) is relatively mature. So one of the first things we thought about is what makes natural communities different than single organisms and there are two main factors. One is that natural communities are diverse. There are multiple genomes, and even potentially millions of species, in a sample. And this is represented here by the presence of red, blue, and green organisms. Another main difference is that these organisms are present at a variable abundance in nature, some are highly abundant some are not.
Firstly, let me acknowledge that assembly for single organisms (especially bacterial) is relatively mature. So one of the first things we thought about is what makes natural communities different than single organisms and there are two main factors. One is that natural communities are diverse. There are multiple genomes, and even potentially millions of species, in a sample. And this is represented here by the presence of red, blue, and green organisms. Another main difference is that these organisms are present at a variable abundance in nature, some are highly abundant some are not.
Firstly, let me acknowledge that assembly for single organisms (especially bacterial) is relatively mature. So one of the first things we thought about is what makes natural communities different than single organisms and there are two main factors. One is that natural communities are diverse. There are multiple genomes, and even potentially millions of species, in a sample. And this is represented here by the presence of red, blue, and green organisms. Another main difference is that these organisms are present at a variable abundance in nature, some are highly abundant some are not. A strategy we came up with was can we come up with a way to come up with the minimal dataset that you need for assembly, discarding these reads from this overkill section?
From a sequencing standpoint then, what we see is that for a given genome (represented here as a dotted line), we start sampling fragments from it.
As we sample more, we will have some sequences which will have errors in it.
And we’ll keep sequencing this genome, randomly sampling different parts of it. We’ll get to a point, where we’ll have enough sequences where we can make a good guess at what the original sequence may have looked like.
For example, here we have a total of 6 sequences for which this particular part highlighted by the black arrow where we can be confident in saying we know what that is. From experience, I know this number here should be about 6 sequences to get an accurate assembly. So anything beyond this 6 is excessive or redundant information.
So we can discard or set aside this read and not use it for our assembly. And that actually turns out to be a good thing because in discarding this information, we’re actually removing data with errors in it.
In the end, we end up with a minimal dataset needed for an assembly of the dataset here in pink and a redundant set of information which we have set aside. In setting aside these reads here in the red, we actually get to discard sequencing errors which actually ends up in improved results for assembly. So eessentially, what we’ve shown is that the assembly of all this data and just the pink data ends up with at least the same assembly if not improved ones. In assembling just the pink dataset though, we’re able to reduce the amount of data we’re working with up to 95% in some environmental datasets I’m working with.
Another tool we’ve developed to deal with biolgoical big data is a lightweight data structure that can break apart these datasets by connectivity.
The system that these methods were developed for was sequencing data that was investigating soil biodiversity in both managed and natural soil systems. Soil biodiversity is amazing.
Great Prairie – world’s most fertile. Important reference site for the biological baseis and ecosystems of soil microbial communities. It sequesters most carbon, produces large amount of biomass anually, key for biofuels and security.
We know surprisingly little about the identities and functions of the microbes inhabiting soil,” With applications of DNA sequencing, the field was really excited about how we could now gauge this specific ecological niche and its responsiveness to change. Once we came up with how do deal with the data, and sift through the gleaned information, it was a sobering reality check on just how hard a challenge these environments will be.
Overall, many funcitons are shared between corn and prairies soils. Interestingly, prairie soils have much many more unique functions (indicated here as blue bars) compared to unique functions in the corn (here green). This result may reflect the varying management history of these two soils. Unlike the prairie soils, which have never been tilled, the corn soils have been cultivated for more than 100 y and have had annual additions of animal manure that potentially could enrich specific metabolic pathways with decreased diversity.
I’m fortunate to have worked on many projects in which I feel that this is true.
Ok, let’s talk about the gut microbiome and how it responds to different diets. This study is a collaboration with the UC.
In In recent years, there has been a growing appreciation for the fact that, as humans, we are in fact supraorganisms composed of both human and microbial cells, and as such we
carry two sets of genes, those encoded in our own genome and those encoded in our microbiota. We genetically inherit only ~1% of our genes from our parents, and the remaining ~99% is mainly acquired from the immedi- ate environment when we are born. Importantly, all the genes in our body, whether human or microbiome encoded, have the potential to have an impact on our health.
The gut is the most densely colonized microbial community in the human body and is also one of the most diverse. The gut functions as a chemostat, a continuous culture system for microorganisms (mostly bacteria) in which fresh nutrients enter the system and cultured microorganisms leave at a relatively constant rate. Approximately 1.5 kg of bacteria are resident in our gut, and 50% of our faecal matter biomass is bacterial cells.
These gut microbiota interacts with our genetics and our environment (mainly diet) to influence our health. The gut microbiota releases toxins, such as lipopolysaccharides, and beneficial metabolites, such as vitamins and short-chain fatty acids, to damage or nourish humans. We know that diet has a greater potential to shape the structure and function of the gut microbiota than host genetics, thus influencing our health state directly.
There are two key efforts that have looked into the response of gut communities to diet changes – one was Zhang et al which worked in mice.
After reading these studies, a key question I had was has anyone looked into the viral components of the human gut. Much of the gut microbiome literature focuses only on the bacterial component of the gut, but we know that viruses are abundant in the gut environment, present at a ratio of 1:1. So beyond bacterial cells, there are…viruses as prophage…
As far as I know, there are only 3 studies that have looked at the gut virome. From these preliminary studies, …
The majority of these viruses are phages, or viruses that only infect bacteria. These viruses are pivotal driver of gut health and disease as these phages are able to redirect the structure and function of the entire gut microbial community.
Specifically, phages can alter the fitness and function of bacterial populations through their transfer of genetic material,
skew the abundance of bacterial population by infection, and drive the evolution of the community with their diversity and modifications of bacterial hosts.
The broad range of subtle and robust effects that phages can exert on the gut microbial community makes them key targets in understanding health and the pathogenesis of gut diseases.
Overlooked
We know that we’ve previously seen in bacterial communities…
Acess throught faecal matter.
We targeted each of these communities in mice that had been fed a baseline diet, then switched diets, and then returned to their original diet. This would allow us to see what was being altered by diet and how much and if functions returned to normal if returned to the baseline diet (or washout). We did this on two diets to see if they had distinct reponses. Additionally, we took 16S rRNA DNA samples from fecal samples every week of the experiment so we could get a better resolution of the changes on community structure. The diets that we are studying here are basically diffefrent in their fat content. There is a low fat and milk fat diet where there is about a four fold difference in diet. But another thing to note is that these diets also differ in the amount of corn starch and sugar…and the LF diet has more corn starch – so complex carbs and the MF diet has more simple sugars, like sucrose. Finally also note that I designed this experiment to really access the virome part of the microbiome, much more so than has previously been looked at.
A key result I would like to discuss is How different communities (both bacterial and viral) change over the course of this experiment. To talk about this result, I’m going to show you the change of abundances over of over 200,000 genes in a way that you can visually interpret. These analyses products are actually another whole talk about challenges of presenting biologial big data. Basically, to see how communities change over time, I’m going to estimate how different the are as a “distance from their original baseline communities). You can imagine three possible ideal sitautions.
The data allows me to look at not only “if communities change” but “how communities change”
One of the strongest signals in the viruses in the MF baseline over time is the decrease of phage related functions which is accompanied by a decreaes in the richness and diveristy of the viral communities. (BLUE)
The strongest signal is that phage functions significantly in free living virus communities, and that there is loss of both the abundance of the free living viral community membership and diversity during MF diet.
This corresponds to siginficant decreases in functions encoded within genes of this commuity.
Reduced availability for the total community, through viral infection.
When we looked at significant changes of sequences associated with different organisms in the MF diet, the phyla Bacetrioidets and Firmicutes showed significant decreases, especially in viruses related to these hosts. This is consistent with previous reports.
We know that diet can be a cause of obesity and that we all respond to diferent diets. Yet, we find that in general, bacteria in our guts our resliient to change. Then, my thought is that there must be something else that is directing our bodies response to diet change…and I would suggest that it could be viral populations…
These viruses are likely to be a pivotal driver of gut health and disease as these phages are able to redirect the structure and function of the entire gut microbial community.
Potential consequences of a temperate phage life cycle in the human gut. Metagenomic studies of viruses Nature Reviews | Microbiology
suggest that a temperate lifestyle is dominant in the distal human gut, in contrast to the the lytic lifestyle observed in open oceans. This temperate lifestyle can have benefits for the phage and the bacterial host, and can alter phage–host dynamics. Integration as a prophage (part a) protects the host from superinfection, effectively ‘immunizing’ the bacterial host against infection from the same or a closely related phage. Furthermore, the genes encoded by the phage genome may expand the niche of the bacterial host by enabling metabolism of new nutrient sources (for example, carbohydrates), providing antibiotic resistance, conveying virulence factors or altering host gene expression. This temperate (lysogenic) life cycle allows phage expansion in a 1:1 ratio with the bacterial host. If the prophage conveys increased fitness to its bacterial host, there will be an increased prevalence of the host and phage in the microbiota. Induction of a lytic cycle (part b) can follow a lysogenic state and can be triggered by environmental stress. As a consequence, bacterial turnover is accelerated and energy utilization is optimized through a ‘phage shunt’, in which the debris remaining after lysis is used as a nutrient source by the surviving bacterial population. Furthermore, a bacterial subpopulation that undergoes lytic induction sweeps away other sensitive species and increases the niche for survivors (that is, bacteria that already have the specific phage integrated into their genome). Periodic induction of prophages can also lead to a constant-diversity dynamic139, which helps maintain community structure and functional efficiency. Novel infections or infections of novel bacterial hosts by phages (part c) bring the benefit of horizontally transferred genes and create selective pressure on the hosts for diversification of their phage receptors, which are often involved in carbohydrate utilization. HGT, horizontal gene transfer.
For the final part of my talk today, I wanted to present some of my ideas for future work, especially work that I feel would be successful here at Iowa State with your expertise and resources.
I hope I’ve presented to you how I’ve been able to leverage next generation sequencing data and big data in biology to start enabling dat driven discovery in this field. Iowa State has done what I think is a really smart thing in identifying it as a great opportunity for research in the future. This parallels an heavy investment at both government and private funding agencies.
Philip E. Bourne, Ph.D., as the first permanent Associate Director for Data Science (ADDS). Dr. Bourne is expected to join the NIH in early 2014.
“Phil will lead an NIH-wide priority initiative to take better advantage of the exponential growth of biomedical research datasets, which is an area of critical importance to biomedical research. The era of ‘Big Data’ has arrived, and it is vital that the NIH play a major role in coordinating access to and analysis of many different data types that make up this revolution in biological information,” said Collins.
As my research is broadly applicable in multiple enviornments within a continuum. I’m also very multidisciplinary and my research occurs on a continuum connecting High Thoughput Data discovery frameworks, big data sacling approaches, and trying to identify the best model systems or biomarkers to investigate complex natural communities.
In the next few slides, I’m going to go through some ideas I’ve had for grants that I will be writing in the next 5 years. But before I do that, I wanted to emphasize what I think are my core research values.
Buliding on viruses and their impacts on human health, a natural extension of the research I presented on the gut microbiome and diet change is to extend to the human system. The NIH is heaveily invested into questions like this.
The grant here is actually unique among the next slides in that it is already funded for the next few years. This is a grant with Kirsten Hofmockel here at ISU, where we’ve been funded to look at microbial drivers of carbon metabolism and warming. Our key goal here is to understand…
Unknowns in soil….A large opportunity for data driven discovery…
Finally,another area of interest I have is the origin and fat of our microbiomes.
So we’ve talked about digital normalization as an effective way of compressing the data or reducing it for assembly. Another strategy we used to deal with very large and complex metagenomes was to come up with a way to break it into pieces that could be analyzed separately. If you’re workign on a jigsaw puzzle, this would be akin to thinking about separating out different colored pieces to work on separatetly. To do this, we first needed a more efficient way to both store and query our data. And next I’ll tell you about how we did this.
The main challenge for dealing with our datasets is that we could not figure out if a Sequence A was connected to sequence B within practical resources – both time and computation. The first trick we could use is that we could break each sequencing read into words of length k. This way rather than comparing long sequences and aligning them and checking for overlaps, we could just look for the presence of the same “word” in two sequences to say that they should be connected together. This trick is a common one used in assembly but isn’t enough for our dataset volume. We simply had too many words to store efficiently.
So we came up with the application of a data structure called a bloom filter. And its this data structure that allowed us to start storing our huge datasets. A good way to think about bloom filters is to think about what I call the mailbox analogy.
Say you have to store the mail of everyone in this room in 3 slots. A solution would be to divide these three slots let people know if they had mail. For example, if there is a marker in a slot that contained your last name, you would go check for mail.
So if an asterix here represents that there is mail in that slot, and I wanted to know if I had mail, there are a couple of outcomes. One is that a slot that doesn’t represent my last name has mail. In that case, I know for sure that I do not have mail. Another possibility is that my representative slot would have mail. I could then possibly have mail or not have mail.
So you can see how we can use these smaller mailboxes to query whether or not we have mail for a large number of people hopefully.
If I wanted to make this even more efficeint, I could add a couple more mailbox setups with varying divisions of last names. So if I came into this mailroom, and looked for mail, I could see that mailbox 1 and 2 say I could have mail but since mailbox 3 is empty, I know I don’t have mail. And its likely someone else who has mail in the two other mailboxes.
False negatives = structure sequences that should not be connected will never be identified as being connected
False positives = sequences may be identified as connected when its not
And you can see how this sort of storage system allows us to have no false negatives. So we apply this same strategy to check for the presence or absense of a sequence, or more specifically a word of length k. And this allows us then to ask is Sequence A connected to Sequence B?