Job Talk Iowa State University Ag Bio Engineering

RIDING THE BIG DATA
TIDAL WAVE IN
MODERN
MICROBIOLOGY
IOWA STATE UNIVERSITY
MARCH 12, 2014
Adina Howe, PhD

Outline of talk
My multi-discipline career
Biological sequencing: a game changer
Research – computational focus:
How to handle “big data” in biology
Research – biological focus:
The gut microbiome’s role in obesity?
Future research:
A flexible toolbox in a big playground

Background
Purdue University, BSME,
Mechanical Engineering
Purdue University, MS,
Environmental Engineering
(Sustainability)

Background
(Sustainability)
University of Iowa, PhD,
(Microbiology/Bioremediatio
n)

Background
(Sustainability)
n)
Michigan State University
NSF Postdoc Math and Biology Fellow (cross-
training)
Microbial Ecology (Jim Tiedje)
Bioinformatics (Titus Brown)

Background
(Sustainability)
n)
Michigan State University
NSF Postdoc Math and Biology Fellow (cross-
training)
Microbial Ecology (Jim Tiedje)
Bioinformatics (Titus Brown)
Computational Biologist
Microbiology / Microbial Ecology

Our shared challenges
Climate Change
Energy Supply
USGCRP 2009
www.alutiiq.com
http://guardianlv.com/
Human Health
An understanding
of microbial ecology

Environmental continuum
MICROBES
IN
ECOSYSTEMS
NATURE
AIR
WATER
SOIL
MICROBIOMES
HUMANS/ANIMAL
ENGINEERED
BIOREACTORS
WASTEWATER

Understanding community
dynamics
 Who is there?
 What are they doing?
 How are they doing it?
Kim Lewis, 2010

Gene / Genome Sequencing
 Collect samples
 Extract DNA
 Sequence DNA
 “Analyze” DNA to identify its content and origin
Taxonomy
(e.g., pathogenic E. Coli)
Function
(e.g., degrades cellulose)

Cost of Sequencing
Stein, Genome Biology, 2010
E. Coli genome 4,500,000 bp ($4.5M, 1992)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DNASequencing,Mbpper$
10,000,000
100,000,000

Rapidly decreasing costs with
NGS Sequencing
Next Generation Sequencing
4,500,000 bp (E. Coli, $200, presently)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000

Effects of low cost
sequencing…
First free-living bacterium sequenced
for billions of dollars and years of
analysis
Personal genome can be
mapped in a few days and
hundreds to few thousand
dollars

The experimental continuum
Single Isolate
Pure Culture
Enrichment
Mixed Cultures
Natural systems

The era of big data in biology
Computational Hardware
(doubling time 14 months)
Sanger Sequencing
NGS (Shotgun) Sequencing
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0
1
10
100
1,000
10,000
100,000
1,000,000
DiskStorage,Mb/$
0.1
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
0.1
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000

Postdoc experience with data
2003-2008 Cumulative sequencing in PhD = 2000 bp
2008-2009 Postdoc Year 1 = 50 Gbp
2009-2010 Postdoc Year 2 = 450 Gbp

Flexibility towards embracing change.
How to survive a data
deluge?
Experimen
t
Design
Data
Generatio
n
Workflow /
Tools
Data
analysis
Applied
Solutions

Reducing data volume:
Assembly of Metagenomic
Sequences
MSU: C. Titus Brown and James Tiedje

de novo assembly
Compresses dataset size significantly
Improved data quality (longer sequences, gene order)
Reference not necessary (novelty)
Raw sequencing data (“reads”) Computational algorithms Informative genes / genome

Metagenome assembly…a scaling
problem.

Shotgun sequencing and de novo
assembly
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness

Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computer

Practical Challenges – Intensive
computing
Months of
“computer
crunching” on a
super computer
Assembly of 300 Gbp can be
done with any assembly program
in less than 14 GB RAM and less
than 24 hours.

Natural community characteristics
 Diverse
 Many organisms
(genomes)

 Diverse
 Many organisms
(genomes)
 Variable abundance
 Most abundant organisms, sampled
more often
 Assembly requires a minimum amount
of sampling
 More sequencing, more errors
Sample 1x

 Diverse
 Many organisms
(genomes)
more often
of sampling
Sample 1x Sample 10x

 Diverse
 Many organisms
(genomes)
more often
of sampling
Sample 1x Sample 10x
Overkill

Digital normalization
Brown et al., 2012, arXiv
Howe et al., PNAS, 2014

Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
 Scales datasets for assembly up to 95% - same assembly
outputs.
 Genomes, mRNA-seq, metagenomes (soils, gut, water)

Partitioning (khmer software)
Pell et al, 2012, PNAS
Howe et al., 2014, PNAS
 Separates metagenomes by species
 Parallel computing possible
 Largest known published soil metagenome and assembly

Tackling Soil Biodiversity
Source: Chuck Hane

Tackling Soil Biodiversity
 Grand Challenge effort –
10% of soil biodiversity
sampled
 Incredible soil biodiversity
(estimate required 10
Tbp/sample)
 “To boldly go where no man
has gone before”: >60%
Unknown
0
100
200
300
400
aminoacidmetabolism
carbohydratemetabolism
membranetransport
signaltransduction
translation
folding,sortinganddegradation
metabolismofcofactorsandvitamins
energymetabolism
transportandcatabolism
lipidmetabolism
transcription
cellgrowthanddeath
replicationandrepair
xenobioticsbiodegradationandmetabolism
nucleotidemetabolism
glycanbiosynthesisandmetabolism
metabolismofterpenoidsandpolyketides
cellmotility
TotalCount
KO
corn and prairie
corn only
prairie only

Big data combined with microbiology will
changes lives.
37

The health and stability of the gut
microbiome (in response to diet change)
University of Chicago: Daina Ringus, PhD & Eugene Chang, MD38
Experimen
t
Design
Data
Generatio
n
Workflow /
Tools
Data
analysis
Applied
Solutions

Interactions between the
microbiome and the environment
40
Source: Zhao, 2013
Obesity
Intestinal inflammation
IBD diseases
Diet has a greater
potential to shape the
structure and function of
gut than host genetics.
Direct influence on health
state

How resilient is the microbiome?
41
In mice, recovery from long term shift to obesity-inducing diet
In humans, microbiome rapidly and reproducibly recovers within 2 days (2013)
In mice, rapid recovery from long term shift to obesity-inducing diet (2012)

Is the gut community going viral?
Reyes et al, Nature Review Microbiology, 2012
42
Bacterial cells Bacterial cells infected
with bacteriophage
Viruses (Bacteriophage)
 Vary by individual (Minot et al., 2011)
 Altered by diet and co-vary with bacteria (Minot et al., 2011)
 Long term stable (Minot et al., 2013)
 Largely temperate (Reyes et al., 2013)
Prophage
Who is in the gut microbiome?

43

44

45

Research Questions
46
 What are the impacts of different diets on gut
microbiome response?
 What are the impacts of viruses in the gut
microbiome (rapid alteration and resilient
response?)
 Multidisciplinary approach combining
 novel experimental targeting of both bacterial and viral
communities
 metagenomic-based sequencing to characterize
community

Novel experimental design – targeted
sampling of community fractions
I. Total DNA (bacteria + prophage + viruses) TOT
II. Virus-like particles
(free-living viruses)
VLP
III. Induced prophage
IND
47
Separation
by density
Chemically
separate
Separation
by size
Microbiome through
faecal matter (non
destructive sampling)

Two baseline diets (with a
perturbation)
Low-fat (LF) baseline diet
Milk-fat (MF) baseline diet
Age (wk)
4 5 6 7 8 9 10 11 12 13 14
Diet Switch Washout (Return to BaselinBaseline
Total community function: TOT metagenomic sequencing at weeks 8, 11, 14
Virome community function: VLP, IND metagenomic sequencing at weeks 8, 11, 14
Weight of mice and count of VLPS with microscopy
Taxonomy analysis (only 16S rRNA gene) every week from week 8 – 14.
48
LF / 10% Fat / Complex Carbs
MF / 37% Fat / Simple Sugars
MF
LF MF
LF
Fecal Samples

Outcomes?
49
Low-fat (LF) baseline diet
Milk-fat (MF) baseline diet
Age (wk)
4 5 6 7 8 9 10 11 12 13 14
Diet Switch Washout (Return to BaselinBaseline
LF / 10% Fat / Complex Carbs
MF / 37% Fat / Simple Sugars
MF
LF MF
LF
Qualitative and Quantitative Measurements:
Who is there? What are they doing?
How much?

How does the community change
over time?
DistancefromBaseline
Baseline Intervention Washout
Altered-Recovery Altered-Altered
Measurements of gene abundance profile
(200,000+ genes) reduced to a single
distance measurement from the original
community (ordination)
No Change

Rapid and resilient bacterial gut
response after diet alteration
***

Diet-specific functional total
community recovery (mostly
bacterial)52
0.000.050.10
Baseline Diet Perturbed Washout
***

53
0.00.10.20.3
Free living viruses in MF baseline
are significantly altered without
recovery.
***

Prophages in MF baseline are
significantly altered without
recovery.54
0.00.10.20.3

“Combat Zone” as diets change
Milk-fat baseline (MF) mice have contrasting bacterial and viral responses, in
which there is not a rapid recovery of viral communities

Viral functions significantly
changed during the milk fat
baseline diet56
Decreases in
Phage-related (p=0.01)
Iron acquisition (p<0.01)
Nucleotide metabolism (p=0.02)
Carbohydrate metabolism (p=0.01)
Motility and chemotaxis (p=0.03)
Virulence and defense (p=0.03)
Phage Iron
Nucleotide Carbs
Baseline - Change -- Washout
Flagella

57
 Bacteroides (Bacterioidetes)
 Clostridium (Firmucutes)
 Eubacterium (Firmucutes)
Significant decrease in genes
associated with MF baseline viruses
Ratio of Firmucutes and
Bacterioidetes associated with
obesity
Turnbaugh, 2008
Bacteriodes fragilis, Nutridesk.com C. difficile, Bioquell.ie National Geographic
Turnbaugh, 2009

Viromes potentially critical in gut
microbiome response.
 Members of gut microbiome community do not
have co-occuring responses.
 Loss of viral population and diversity is diet
specific (related to a milkfat to lowfat diet
transition)

Ability to redirect structure and function of
microbiome makes them pivotal drivers of health and
disease
59

Virome directly causes host response
Germ Free 11 week old mice (n = 3)
Diet: Standard chow
3 week conventionalization
60
A “standard control”
Microbiome:
Uniform cecal content
of standard chow
mice
Experimentally
introduced viruses
Mouse Treatment I:
Lowfat baseline
VLP
Mouse Treatment
2: Milkfat baseline
VLP
Control: Buffer

Significant decrease of intestinal
inflammation in LF VLP treatments61
Pro-inflammatory cytokines in mucosal scrapings
TNF-α INF-γ
Proximal colon
TNF-alpha(ng/gl)
C
ontrol
LF
VLPs
M
F
VLPs
0
5
10
15
Proximal colon
INF-gamma(ng/g)
C
ontrol
LF
VLPs
M
F
VLPs
0
10
20
30
*

Conclusions
 Gut microbiome has reproducible and distinct
responses to diet.
 Viruses have a unique response to diet
perturbations and do not co-occur with bacteria.
 Viruses observed to cause inflammation in
infected germ free mice.
 Big data workflow enabled strategic sampling
design providing unparalleled access to
viruses of gut microbiome
62

Data-discovery is a national
investment.

Data-driven biological
investigations
MICROBES
IN
ECOSYSTEMS
NATURE
WATER
SOIL
MICROBIOMES
HUMANS/ANIMAL
ENGINEERED
WASTEWATER
High Throughput Frameworks:
Metagenomic
Metatranscriptomic
Metaproteomic
More relevant model
systems
Improved biomarkers
Scaling approaches
Big data computation
Data driven discovery

Core research values
 Research that matters
 Developing scientific frameworks that enable
open-science initiatives (reproducible science)
 Computational and experimental integration
 Scale and power to multi-disciplinary
approaches
 Team value
 Flexibility

Going viral: The role of the human gut
phageome in inflammatory bowel disease
Objectives:
 Define and compare core phageomes
associated with healthy and diseased
gut microbiomes
 Determine impact of disease-associated
gut phageomes on development of
disease in knockout mouse models
(predisposed to disease)
NIH, National Institute of Diabetes and Digestive and
Kidney Diseases; National Institute of Allergy and Infectious
Diseases ($3-5M)
Source: Nature.com
What is the role of host-phage
dynamics in the development of
intestinal diseases?
Integration of multiple datasets
Improved model systems and
biomarkers

Microbial drivers of carbon metabolism and
warming
DOE Biological and Environmental
Research ($3M/3 years, 40% PI with
ISU Kirsten Hofmockel, 2013-2016)
Source: Oakridge National LaboratoryContributions:
• Omic-based characterization of carbon cycling microorganisms
in the soil
• Novel approaches to target carbon cycling subsets of
community
• Improved soil genomic databases to enable future carbon
studies
Source: Oakridge National LaboratoryHow do microbes contribute to
carbon cycling models?
Big data scaling
Integration of multiple
datasets

Large-scale characterization of global dark
matter proteins in complex biological
environments
NIH – Development of Software and Analysis Methods for Biomedical
Big Data in Targeted Areas of High Need
(~$1M/3 years)
Gordon and Betty Moore – Data Driven Discovery Investigator Awards
($1.5M / 5 years)
Novel extension of current software tools:
• Integration of growing volumes of global public datasets with scalable
data-mining analysis
• Lightweight data architecture to compare abundance and co-
occurrence of sequencing patterns across multiple samples and
associated metadata to elucidate information
How do we access the novelty observed in metagenomic dataset
Big data scaling
Integration of datasets

From field to food: The origin and
fate of our microbiomes
USDA Agriculture and Food Research Initiative ($1-
2.5M)
• Identify and characterize under-
researched foodborne microbial hazards
and effective control strategies
• Elucidate fate and dissemination of
foodborne microbial hazards associated
with produce production and processing Source: aboretum.umn.edu
Where do harmful microbes in our food come
from and how do we protect ourselves from
them?
Integration of multiple datasets
Improved model systems and

Acknowledgements
 Funding
 DOE Microbial Carbon Cycling Grant
 NSF Postdoc Fellowship, Great Lakes Bioenergy
Research Center
 Microbiome: University of Chicago Digestive Diseases
Research Core Pilot and Feasibility Grant
 My Awesome INTER-DISCIPLINARY Team
 C. Titus Brown (MSU) + lab (Bioinformatics)
 James Tiedje (MSU) + lab (Microbial Ecology)
 Daina Ringus (UC) (Microbiology / Mice)
 Kirsten Hofmockel, Ryan Williams, Fan Yang (ISU)
 Eugene Chang (UC)
 Folker Meyer (ANL)
71

Reducing data, not information.
More efficient data storage and mining.
Big data scaling approaches

Storage of biological big data
 What other sequences are connected to
Sequence X?
 Data broken into words of length “k” (k-mers)
 Overlap (for assembly) = shared “word”
Pell, PNAS, 2014
Howe, PNAS,
AGTCAGTT
Into its 4-mers:
AGTC
GTCA
TCAG
CAGT
AGTT
AGAAAGTC
Into its 4-mers:
AGAA
GAAA
AAAG
CAGT
AGTC

Storage of biological big data
Sequence X?
 Data broken into words of length “k” (k-mers)
 How do we store “big data” words?
 Bloom filter data structure
 Efficient storage

Do I have mail?
Sequence X?
 Data broken into bins of word length “k” (k-mers)
 Mailbox analogy
A-G H-R S-Z
Pell, PNAS, 2014
Howe, PNAS,

 Is Sequencing A connected to Sequence B?
 Mailbox analogy – Efficient storage of information
A-G H-R S-Z
A-G* H-R S-Z
No mail for Howe, 100% sure.
A-G H-R* S-Z
Possibly mail for Howe.
Pell, PNAS, 2014
Howe, PNAS,
Do I have mail?

A-G H-R S-Z
A-G H-R* S-Z
G-N* A-F; O-T U-Z
D-H* A-C; I-O P-Z
Howe mail status:
Mail possibility higher.
Do I have mail?

A-G H-R S-Z
A-G H-R* S-Z
G-N* A-F; O-T U-Z
D-H A-C; I-O P-Z
Howe mail status:
No mail, 100% sure.
Do I have mail?

Bloom filter data structure
 “Probablistic” data structure
 Decrease of false positive rate with multiple
bloom filters – “More likely I have mail”
 No false negatives – “No mail. 100% sure”
 For the win: both detects and counts presence
of sequences (k-mers) and their connectivity
efficiently
 Is sequence A connected to sequence B?
Pell, PNAS, 2014
Howe, PNAS,

Job Talk Iowa State University Ag Bio Engineering

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Job Talk Iowa State University Ag Bio Engineering

Ähnlich wie Job Talk Iowa State University Ag Bio Engineering (20)

Mehr von Adina Chuang Howe

Mehr von Adina Chuang Howe (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Job Talk Iowa State University Ag Bio Engineering

Hinweis der Redaktion