SlideShare ist ein Scribd-Unternehmen logo
1 von 80
RIDING THE BIG DATA
TIDAL WAVE IN
MODERN
MICROBIOLOGY
IOWA STATE UNIVERSITY
MARCH 12, 2014
Adina Howe, PhD
Outline of talk
My multi-discipline career
Biological sequencing: a game changer
Research – computational focus:
How to handle “big data” in biology
Research – biological focus:
The gut microbiome’s role in obesity?
Future research:
A flexible toolbox in a big playground
Background
Purdue University, BSME,
Mechanical Engineering
Purdue University, MS,
Environmental Engineering
(Sustainability)
Background
Purdue University, BSME,
Mechanical Engineering
Purdue University, MS,
Environmental Engineering
(Sustainability)
University of Iowa, PhD,
Environmental Engineering
(Microbiology/Bioremediatio
n)
Background
Purdue University, BSME,
Mechanical Engineering
Purdue University, MS,
Environmental Engineering
(Sustainability)
University of Iowa, PhD,
Environmental Engineering
(Microbiology/Bioremediatio
n)
Michigan State University
NSF Postdoc Math and Biology Fellow (cross-
training)
Microbial Ecology (Jim Tiedje)
Bioinformatics (Titus Brown)
Background
Purdue University, BSME,
Mechanical Engineering
Purdue University, MS,
Environmental Engineering
(Sustainability)
University of Iowa, PhD,
Environmental Engineering
(Microbiology/Bioremediatio
n)
Michigan State University
NSF Postdoc Math and Biology Fellow (cross-
training)
Microbial Ecology (Jim Tiedje)
Bioinformatics (Titus Brown)
Computational Biologist
Microbiology / Microbial Ecology
Our shared challenges
Climate Change
Energy Supply
USGCRP 2009
www.alutiiq.com
http://guardianlv.com/
Human Health
An understanding
of microbial ecology
Environmental continuum
MICROBES
IN
ECOSYSTEMS
NATURE
AIR
WATER
SOIL
MICROBIOMES
HUMANS/ANIMAL
ENGINEERED
BIOREACTORS
WASTEWATER
Understanding community
dynamics
 Who is there?
 What are they doing?
 How are they doing it?
Kim Lewis, 2010
Gene / Genome Sequencing
 Collect samples
 Extract DNA
 Sequence DNA
 “Analyze” DNA to identify its content and origin
Taxonomy
(e.g., pathogenic E. Coli)
Function
(e.g., degrades cellulose)
Cost of Sequencing
Stein, Genome Biology, 2010
E. Coli genome 4,500,000 bp ($4.5M, 1992)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DNASequencing,Mbpper$
10,000,000
100,000,000
Rapidly decreasing costs with
NGS Sequencing
Stein, Genome Biology, 2010
Next Generation Sequencing
4,500,000 bp (E. Coli, $200, presently)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DNASequencing,Mbpper$
10,000,000
100,000,000
Effects of low cost
sequencing…
First free-living bacterium sequenced
for billions of dollars and years of
analysis
Personal genome can be
mapped in a few days and
hundreds to few thousand
dollars
The experimental continuum
Single Isolate
Pure Culture
Enrichment
Mixed Cultures
Natural systems
The era of big data in biology
Stein, Genome Biology, 2010
Computational Hardware
(doubling time 14 months)
Sanger Sequencing
(doubling time 19 months)
NGS (Shotgun) Sequencing
(doubling time 5 months)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0
1
10
100
1,000
10,000
100,000
1,000,000
DiskStorage,Mb/$
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DNASequencing,Mbpper$
10,000,000
100,000,000
0.1
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
Postdoc experience with data
2003-2008 Cumulative sequencing in PhD = 2000 bp
2008-2009 Postdoc Year 1 = 50 Gbp
2009-2010 Postdoc Year 2 = 450 Gbp
Flexibility towards embracing change.
How to survive a data
deluge?
Experimen
t
Design
Data
Generatio
n
Workflow /
Tools
Data
analysis
Applied
Solutions
Reducing data volume:
Assembly of Metagenomic
Sequences
MSU: C. Titus Brown and James Tiedje
de novo assembly
Compresses dataset size significantly
Improved data quality (longer sequences, gene order)
Reference not necessary (novelty)
Raw sequencing data (“reads”) Computational algorithms Informative genes / genome
Metagenome assembly…a scaling
problem.
Shotgun sequencing and de novo
assembly
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness
Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computer
Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computer
Assembly of 300 Gbp can be
done with any assembly program
in less than 14 GB RAM and less
than 24 hours.
Natural community characteristics
 Diverse
 Many organisms
(genomes)
Natural community characteristics
 Diverse
 Many organisms
(genomes)
 Variable abundance
 Most abundant organisms, sampled
more often
 Assembly requires a minimum amount
of sampling
 More sequencing, more errors
Sample 1x
Natural community characteristics
 Diverse
 Many organisms
(genomes)
 Variable abundance
 Most abundant organisms, sampled
more often
 Assembly requires a minimum amount
of sampling
 More sequencing, more errors
Sample 1x Sample 10x
Natural community characteristics
 Diverse
 Many organisms
(genomes)
 Variable abundance
 Most abundant organisms, sampled
more often
 Assembly requires a minimum amount
of sampling
 More sequencing, more errors
Sample 1x Sample 10x
Overkill
Digital normalization
Brown et al., 2012, arXiv
Howe et al., PNAS, 2014
Digital normalization
Brown et al., 2012, arXiv
Howe et al., PNAS, 2014
Digital normalization
Brown et al., 2012, arXiv
Howe et al., PNAS, 2014
Digital normalization
Brown et al., 2012, arXiv
Howe et al., PNAS, 2014
Digital normalization
Brown et al., 2012, arXiv
Howe et al., PNAS, 2014
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
 Scales datasets for assembly up to 95% - same assembly
outputs.
 Genomes, mRNA-seq, metagenomes (soils, gut, water)
Partitioning (khmer software)
Pell et al, 2012, PNAS
Howe et al., 2014, PNAS
 Separates metagenomes by species
 Parallel computing possible
 Largest known published soil metagenome and assembly
Tackling Soil Biodiversity
Source: Chuck Hane
Tackling Soil Biodiversity
 Grand Challenge effort –
10% of soil biodiversity
sampled
 Incredible soil biodiversity
(estimate required 10
Tbp/sample)
 “To boldly go where no man
has gone before”: >60%
Unknown
0
100
200
300
400
aminoacidmetabolism
carbohydratemetabolism
membranetransport
signaltransduction
translation
folding,sortinganddegradation
metabolismofcofactorsandvitamins
energymetabolism
transportandcatabolism
lipidmetabolism
transcription
cellgrowthanddeath
replicationandrepair
xenobioticsbiodegradationandmetabolism
nucleotidemetabolism
glycanbiosynthesisandmetabolism
metabolismofterpenoidsandpolyketides
cellmotility
TotalCount
KO
corn and prairie
corn only
prairie only
Howe et al, 2014, PNAS
Big data combined with microbiology will
changes lives.
37
The health and stability of the gut
microbiome (in response to diet change)
University of Chicago: Daina Ringus, PhD & Eugene Chang, MD38
Experimen
t
Design
Data
Generatio
n
Workflow /
Tools
Data
analysis
Applied
Solutions
We are supraorganisms
39
Interactions between the
microbiome and the environment
40
Source: Zhao, 2013
Obesity
Intestinal inflammation
IBD diseases
Diet has a greater
potential to shape the
structure and function of
gut than host genetics.
Direct influence on health
state
How resilient is the microbiome?
41
In mice, recovery from long term shift to obesity-inducing diet
In humans, microbiome rapidly and reproducibly recovers within 2 days (2013)
In mice, rapid recovery from long term shift to obesity-inducing diet (2012)
Is the gut community going viral?
Reyes et al, Nature Review Microbiology, 2012
42
Bacterial cells Bacterial cells infected
with bacteriophage
Viruses (Bacteriophage)
 Vary by individual (Minot et al., 2011)
 Altered by diet and co-vary with bacteria (Minot et al., 2011)
 Long term stable (Minot et al., 2013)
 Largely temperate (Reyes et al., 2013)
Prophage
Who is in the gut microbiome?
Is the gut community going viral?
Reyes et al, Nature Review Microbiology, 2012
43
Is the gut community going viral?
Reyes et al, Nature Review Microbiology, 2012
44
Is the gut community going viral?
Reyes et al, Nature Review Microbiology, 2012
45
Research Questions
46
 What are the impacts of different diets on gut
microbiome response?
 What are the impacts of viruses in the gut
microbiome (rapid alteration and resilient
response?)
 Multidisciplinary approach combining
 novel experimental targeting of both bacterial and viral
communities
 metagenomic-based sequencing to characterize
community
Novel experimental design – targeted
sampling of community fractions
I. Total DNA (bacteria + prophage + viruses) TOT
II. Virus-like particles
(free-living viruses)
VLP
III. Induced prophage
IND
47
Separation
by density
Chemically
separate
Separation
by size
Microbiome through
faecal matter (non
destructive sampling)
Two baseline diets (with a
perturbation)
Low-fat (LF) baseline diet
Milk-fat (MF) baseline diet
Age (wk)
4 5 6 7 8 9 10 11 12 13 14
Diet Switch Washout (Return to BaselinBaseline
Total community function: TOT metagenomic sequencing at weeks 8, 11, 14
Virome community function: VLP, IND metagenomic sequencing at weeks 8, 11, 14
Weight of mice and count of VLPS with microscopy
Taxonomy analysis (only 16S rRNA gene) every week from week 8 – 14.
48
LF / 10% Fat / Complex Carbs
MF / 37% Fat / Simple Sugars
MF
LF MF
LF
Fecal Samples
Outcomes?
49
Low-fat (LF) baseline diet
Milk-fat (MF) baseline diet
Age (wk)
4 5 6 7 8 9 10 11 12 13 14
Diet Switch Washout (Return to BaselinBaseline
LF / 10% Fat / Complex Carbs
MF / 37% Fat / Simple Sugars
MF
LF MF
LF
Qualitative and Quantitative Measurements:
Who is there? What are they doing?
How much?
How does the community change
over time?
DistancefromBaseline
Baseline Intervention Washout
DistancefromBaseline
Baseline Intervention Washout
Altered-Recovery Altered-Altered
Measurements of gene abundance profile
(200,000+ genes) reduced to a single
distance measurement from the original
community (ordination)
Baseline Intervention Washout
No Change
DistancefromBaseline
Rapid and resilient bacterial gut
response after diet alteration
DistancefromBaseline
***
Baseline Intervention Washout
Diet-specific functional total
community recovery (mostly
bacterial)52
0.000.050.10
DistancefromBaseline
Baseline Diet Perturbed Washout
***
53
0.00.10.20.3
DistancefromBaseline
Free living viruses in MF baseline
are significantly altered without
recovery.
Baseline Diet Perturbed Washout
***
Prophages in MF baseline are
significantly altered without
recovery.54
0.00.10.20.3
DistancefromBaseline
Baseline Diet Perturbed Washout
“Combat Zone” as diets change
Milk-fat baseline (MF) mice have contrasting bacterial and viral responses, in
which there is not a rapid recovery of viral communities
Viral functions significantly
changed during the milk fat
baseline diet56
Decreases in
Phage-related (p=0.01)
Iron acquisition (p<0.01)
Nucleotide metabolism (p=0.02)
Carbohydrate metabolism (p=0.01)
Motility and chemotaxis (p=0.03)
Virulence and defense (p=0.03)
Phage Iron
Nucleotide Carbs
Baseline - Change -- Washout
Flagella
57
 Bacteroides (Bacterioidetes)
 Clostridium (Firmucutes)
 Eubacterium (Firmucutes)
Significant decrease in genes
associated with MF baseline viruses
Ratio of Firmucutes and
Bacterioidetes associated with
obesity
Turnbaugh, 2008
Bacteriodes fragilis, Nutridesk.com C. difficile, Bioquell.ie National Geographic
Turnbaugh, 2009
Viromes potentially critical in gut
microbiome response.
 Members of gut microbiome community do not
have co-occuring responses.
 Loss of viral population and diversity is diet
specific (related to a milkfat to lowfat diet
transition)
Ability to redirect structure and function of
microbiome makes them pivotal drivers of health and
disease
Reyes et al, Nature Review Microbiology, 2012
59
Virome directly causes host response
Germ Free 11 week old mice (n = 3)
Diet: Standard chow
3 week conventionalization
60
A “standard control”
Microbiome:
Uniform cecal content
of standard chow
mice
Experimentally
introduced viruses
Mouse Treatment I:
Lowfat baseline
VLP
Mouse Treatment
2: Milkfat baseline
VLP
Control: Buffer
Significant decrease of intestinal
inflammation in LF VLP treatments61
Pro-inflammatory cytokines in mucosal scrapings
TNF-α INF-γ
Proximal colon
TNF-alpha(ng/gl)
C
ontrol
LF
VLPs
M
F
VLPs
0
5
10
15
Proximal colon
INF-gamma(ng/g)
C
ontrol
LF
VLPs
M
F
VLPs
0
10
20
30
*
Conclusions
 Gut microbiome has reproducible and distinct
responses to diet.
 Viruses have a unique response to diet
perturbations and do not co-occur with bacteria.
 Viruses observed to cause inflammation in
infected germ free mice.
 Big data workflow enabled strategic sampling
design providing unparalleled access to
viruses of gut microbiome
62
Future work
Data-discovery is a national
investment.
Data-driven biological
investigations
MICROBES
IN
ECOSYSTEMS
NATURE
WATER
SOIL
MICROBIOMES
HUMANS/ANIMAL
ENGINEERED
WASTEWATER
High Throughput Frameworks:
Metagenomic
Metatranscriptomic
Metaproteomic
More relevant model
systems
Improved biomarkers
Scaling approaches
Big data computation
Data driven discovery
Core research values
 Research that matters
 Developing scientific frameworks that enable
open-science initiatives (reproducible science)
 Computational and experimental integration
 Scale and power to multi-disciplinary
approaches
 Team value
 Flexibility
Going viral: The role of the human gut
phageome in inflammatory bowel disease
Objectives:
 Define and compare core phageomes
associated with healthy and diseased
gut microbiomes
 Determine impact of disease-associated
gut phageomes on development of
disease in knockout mouse models
(predisposed to disease)
NIH, National Institute of Diabetes and Digestive and
Kidney Diseases; National Institute of Allergy and Infectious
Diseases ($3-5M)
Source: Nature.com
What is the role of host-phage
dynamics in the development of
intestinal diseases?
Integration of multiple datasets
Improved model systems and
biomarkers
Microbial drivers of carbon metabolism and
warming
DOE Biological and Environmental
Research ($3M/3 years, 40% PI with
ISU Kirsten Hofmockel, 2013-2016)
Source: Oakridge National LaboratoryContributions:
• Omic-based characterization of carbon cycling microorganisms
in the soil
• Novel approaches to target carbon cycling subsets of
community
• Improved soil genomic databases to enable future carbon
studies
Source: Oakridge National LaboratoryHow do microbes contribute to
carbon cycling models?
Big data scaling
Integration of multiple
datasets
Large-scale characterization of global dark
matter proteins in complex biological
environments
NIH – Development of Software and Analysis Methods for Biomedical
Big Data in Targeted Areas of High Need
(~$1M/3 years)
Gordon and Betty Moore – Data Driven Discovery Investigator Awards
($1.5M / 5 years)
Novel extension of current software tools:
• Integration of growing volumes of global public datasets with scalable
data-mining analysis
• Lightweight data architecture to compare abundance and co-
occurrence of sequencing patterns across multiple samples and
associated metadata to elucidate information
How do we access the novelty observed in metagenomic dataset
Big data scaling
Integration of datasets
From field to food: The origin and
fate of our microbiomes
USDA Agriculture and Food Research Initiative ($1-
2.5M)
• Identify and characterize under-
researched foodborne microbial hazards
and effective control strategies
• Elucidate fate and dissemination of
foodborne microbial hazards associated
with produce production and processing Source: aboretum.umn.edu
Where do harmful microbes in our food come
from and how do we protect ourselves from
them?
Integration of multiple datasets
Improved model systems and
Acknowledgements
 Funding
 DOE Microbial Carbon Cycling Grant
 NSF Postdoc Fellowship, Great Lakes Bioenergy
Research Center
 Microbiome: University of Chicago Digestive Diseases
Research Core Pilot and Feasibility Grant
 My Awesome INTER-DISCIPLINARY Team
 C. Titus Brown (MSU) + lab (Bioinformatics)
 James Tiedje (MSU) + lab (Microbial Ecology)
 Daina Ringus (UC) (Microbiology / Mice)
 Kirsten Hofmockel, Ryan Williams, Fan Yang (ISU)
 Eugene Chang (UC)
 Folker Meyer (ANL)
71
Questions?
Reducing data, not information.
More efficient data storage and mining.
Big data scaling approaches
Storage of biological big data
 What other sequences are connected to
Sequence X?
 Data broken into words of length “k” (k-mers)
 Overlap (for assembly) = shared “word”
Pell, PNAS, 2014
Howe, PNAS,
AGTCAGTT
Into its 4-mers:
AGTC
GTCA
TCAG
CAGT
AGTT
AGAAAGTC
Into its 4-mers:
AGAA
GAAA
AAAG
CAGT
AGTC
Storage of biological big data
 What other sequences are connected to
Sequence X?
 Data broken into words of length “k” (k-mers)
 Overlap (for assembly) = shared “word”
 How do we store “big data” words?
 Bloom filter data structure
 Efficient storage
Do I have mail?
 What other sequences are connected to
Sequence X?
 Data broken into bins of word length “k” (k-mers)
 Overlap (for assembly) = shared “word”
 How do we store “big data” words?
 Bloom filter data structure
 Mailbox analogy
A-G H-R S-Z
Pell, PNAS, 2014
Howe, PNAS,
 Is Sequencing A connected to Sequence B?
 Data broken into bins of word length “k” (k-mers)
 Overlap (for assembly) = shared “word”
 How do we store “big data” words?
 Bloom filter data structure
 Mailbox analogy – Efficient storage of information
A-G H-R S-Z
A-G* H-R S-Z
No mail for Howe, 100% sure.
A-G H-R* S-Z
Possibly mail for Howe.
Pell, PNAS, 2014
Howe, PNAS,
Do I have mail?
 Is Sequencing A connected to Sequence B?
 Data broken into bins of word length “k” (k-mers)
 Overlap (for assembly) = shared “word”
 How do we store “big data” words?
 Bloom filter data structure
 Mailbox analogy – Efficient storage of information
A-G H-R S-Z
A-G H-R* S-Z
G-N* A-F; O-T U-Z
D-H* A-C; I-O P-Z
Howe mail status:
Mail possibility higher.
Do I have mail?
 Is Sequencing A connected to Sequence B?
 Data broken into bins of word length “k” (k-mers)
 Overlap (for assembly) = shared “word”
 How do we store “big data” words?
 Bloom filter data structure
 Mailbox analogy – Efficient storage of information
A-G H-R S-Z
A-G H-R* S-Z
G-N* A-F; O-T U-Z
D-H A-C; I-O P-Z
Howe mail status:
No mail, 100% sure.
Do I have mail?
Bloom filter data structure
 “Probablistic” data structure
 Decrease of false positive rate with multiple
bloom filters – “More likely I have mail”
 No false negatives – “No mail. 100% sure”
 For the win: both detects and counts presence
of sequences (k-mers) and their connectivity
efficiently
 Is sequence A connected to sequence B?
Pell, PNAS, 2014
Howe, PNAS,

Weitere ähnliche Inhalte

Was ist angesagt?

さらば!データサイエンティスト
さらば!データサイエンティストさらば!データサイエンティスト
さらば!データサイエンティストShohei Hido
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
最近のRのランダムフォレストパッケージ -ranger/Rborist-
最近のRのランダムフォレストパッケージ -ranger/Rborist-最近のRのランダムフォレストパッケージ -ranger/Rborist-
最近のRのランダムフォレストパッケージ -ranger/Rborist-Shintaro Fukushima
 
내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우 (개정증보판) (UST 대학원 신입생 OT 강연)
내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우 (개정증보판) (UST 대학원 신입생 OT 강연)내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우 (개정증보판) (UST 대학원 신입생 OT 강연)
내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우 (개정증보판) (UST 대학원 신입생 OT 강연)Yoon Sup Choi
 
論文の考察の書き方
論文の考察の書き方論文の考察の書き方
論文の考察の書き方Yosuke Uozumi
 
文献調査をどのように行うべきか?
文献調査をどのように行うべきか?文献調査をどのように行うべきか?
文献調査をどのように行うべきか?Yuichi Goto
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Amazonでのレコメンド生成における深層学習とAWS利用について
Amazonでのレコメンド生成における深層学習とAWS利用についてAmazonでのレコメンド生成における深層学習とAWS利用について
Amazonでのレコメンド生成における深層学習とAWS利用についてAmazon Web Services Japan
 
Solr から使う OpenNLP の日本語固有表現抽出
Solr から使う OpenNLP の日本語固有表現抽出Solr から使う OpenNLP の日本語固有表現抽出
Solr から使う OpenNLP の日本語固有表現抽出Koji Sekiguchi
 
匿名加工情報を使えないものか?(改訂版)
匿名加工情報を使えないものか?(改訂版)匿名加工情報を使えないものか?(改訂版)
匿名加工情報を使えないものか?(改訂版)Hiroshi Nakagawa
 
企業における自然言語処理技術利用の最先端
企業における自然言語処理技術利用の最先端企業における自然言語処理技術利用の最先端
企業における自然言語処理技術利用の最先端Yuya Unno
 
そろそろRStudioの話
そろそろRStudioの話そろそろRStudioの話
そろそろRStudioの話Kazuya Wada
 
Big data introduction
Big data introductionBig data introduction
Big data introductionChirag Ahuja
 
내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우
내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우 내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우
내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우 Yoon Sup Choi
 
Linked Open Data勉強会2020 前編:LODの基礎・作成・公開
Linked Open Data勉強会2020 前編:LODの基礎・作成・公開Linked Open Data勉強会2020 前編:LODの基礎・作成・公開
Linked Open Data勉強会2020 前編:LODの基礎・作成・公開KnowledgeGraph
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)SiamAhmed16
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
グラフデータベース:Neo4j、そしてRDBからの移行手順の紹介
グラフデータベース:Neo4j、そしてRDBからの移行手順の紹介グラフデータベース:Neo4j、そしてRDBからの移行手順の紹介
グラフデータベース:Neo4j、そしてRDBからの移行手順の紹介ippei_suzuki
 

Was ist angesagt? (20)

さらば!データサイエンティスト
さらば!データサイエンティストさらば!データサイエンティスト
さらば!データサイエンティスト
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
最近のRのランダムフォレストパッケージ -ranger/Rborist-
最近のRのランダムフォレストパッケージ -ranger/Rborist-最近のRのランダムフォレストパッケージ -ranger/Rborist-
最近のRのランダムフォレストパッケージ -ranger/Rborist-
 
내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우 (개정증보판) (UST 대학원 신입생 OT 강연)
내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우 (개정증보판) (UST 대학원 신입생 OT 강연)내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우 (개정증보판) (UST 대학원 신입생 OT 강연)
내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우 (개정증보판) (UST 대학원 신입생 OT 강연)
 
論文の考察の書き方
論文の考察の書き方論文の考察の書き方
論文の考察の書き方
 
文献調査をどのように行うべきか?
文献調査をどのように行うべきか?文献調査をどのように行うべきか?
文献調査をどのように行うべきか?
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Amazonでのレコメンド生成における深層学習とAWS利用について
Amazonでのレコメンド生成における深層学習とAWS利用についてAmazonでのレコメンド生成における深層学習とAWS利用について
Amazonでのレコメンド生成における深層学習とAWS利用について
 
Solr から使う OpenNLP の日本語固有表現抽出
Solr から使う OpenNLP の日本語固有表現抽出Solr から使う OpenNLP の日本語固有表現抽出
Solr から使う OpenNLP の日本語固有表現抽出
 
匿名加工情報を使えないものか?(改訂版)
匿名加工情報を使えないものか?(改訂版)匿名加工情報を使えないものか?(改訂版)
匿名加工情報を使えないものか?(改訂版)
 
Data analytics
Data analyticsData analytics
Data analytics
 
企業における自然言語処理技術利用の最先端
企業における自然言語処理技術利用の最先端企業における自然言語処理技術利用の最先端
企業における自然言語処理技術利用の最先端
 
そろそろRStudioの話
そろそろRStudioの話そろそろRStudioの話
そろそろRStudioの話
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
 
論文の書き方入門 2017
論文の書き方入門 2017論文の書き方入門 2017
論文の書き方入門 2017
 
내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우
내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우 내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우
내가 대학원에 들어왔을 때 알았더라면 좋았을 연구 노하우
 
Linked Open Data勉強会2020 前編:LODの基礎・作成・公開
Linked Open Data勉強会2020 前編:LODの基礎・作成・公開Linked Open Data勉強会2020 前編:LODの基礎・作成・公開
Linked Open Data勉強会2020 前編:LODの基礎・作成・公開
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
グラフデータベース:Neo4j、そしてRDBからの移行手順の紹介
グラフデータベース:Neo4j、そしてRDBからの移行手順の紹介グラフデータベース:Neo4j、そしてRDBからの移行手順の紹介
グラフデータベース:Neo4j、そしてRDBからの移行手順の紹介
 

Andere mochten auch

Job Talk (2012): University of Western Ontario
Job Talk (2012): University of Western OntarioJob Talk (2012): University of Western Ontario
Job Talk (2012): University of Western OntarioMichael Barbour
 
Job Talk: Research (2013) - Kennesaw State University
Job Talk: Research (2013) - Kennesaw State UniversityJob Talk: Research (2013) - Kennesaw State University
Job Talk: Research (2013) - Kennesaw State UniversityMichael Barbour
 
2008 Osu Job Talk 12 05
2008 Osu Job Talk 12 052008 Osu Job Talk 12 05
2008 Osu Job Talk 12 05johnybaek
 
JHU Job Talk
JHU Job TalkJHU Job Talk
JHU Job Talkjtleek
 
How to talk about your job
How to talk about your jobHow to talk about your job
How to talk about your jobIsadown
 
Sarkari naukri for assistant professor job
Sarkari naukri for assistant professor jobSarkari naukri for assistant professor job
Sarkari naukri for assistant professor jobsarkari naukri
 
Self Sustainable Plants: the contribution of soil-borne beneficial microbes
Self Sustainable Plants: the contribution of soil-borne beneficial microbesSelf Sustainable Plants: the contribution of soil-borne beneficial microbes
Self Sustainable Plants: the contribution of soil-borne beneficial microbesFood and Feed for Wellbeing
 
BIO 130 Physical and Chemical Properties of Soil Lab Quiz
BIO 130 Physical and Chemical Properties of Soil Lab QuizBIO 130 Physical and Chemical Properties of Soil Lab Quiz
BIO 130 Physical and Chemical Properties of Soil Lab QuizJukKols
 
行動互聯網時代的新經濟與創新思維
行動互聯網時代的新經濟與創新思維行動互聯網時代的新經濟與創新思維
行動互聯網時代的新經濟與創新思維Danny Lin
 
Research, strategy, inspiration.
Research, strategy, inspiration.Research, strategy, inspiration.
Research, strategy, inspiration.Semio srl
 
Roehe_Microbiology_Society_2016_Edinburgh_v3a
Roehe_Microbiology_Society_2016_Edinburgh_v3aRoehe_Microbiology_Society_2016_Edinburgh_v3a
Roehe_Microbiology_Society_2016_Edinburgh_v3aRainer Roehe
 
DOAS Online Outreach Strategy, Phase 1
DOAS Online Outreach Strategy, Phase 1DOAS Online Outreach Strategy, Phase 1
DOAS Online Outreach Strategy, Phase 1Seth Stuck
 
Strategy and Collaboration: The Keys to BHL Outreach Success
Strategy and Collaboration: The Keys to BHL Outreach SuccessStrategy and Collaboration: The Keys to BHL Outreach Success
Strategy and Collaboration: The Keys to BHL Outreach Successgduke599
 
ThinkVis 2013 - Content Outreach & Engagement - 02.03.2013
ThinkVis 2013 - Content Outreach & Engagement - 02.03.2013ThinkVis 2013 - Content Outreach & Engagement - 02.03.2013
ThinkVis 2013 - Content Outreach & Engagement - 02.03.2013Pak Hou Cheung
 
Microbial Pathogenesis and Host Immune Response
Microbial Pathogenesis and Host Immune ResponseMicrobial Pathogenesis and Host Immune Response
Microbial Pathogenesis and Host Immune ResponseQIAGEN
 
Stakeholder Outreach and Engagement - Encouraging Use of New Scientific Data
Stakeholder Outreach and Engagement - Encouraging Use of New Scientific DataStakeholder Outreach and Engagement - Encouraging Use of New Scientific Data
Stakeholder Outreach and Engagement - Encouraging Use of New Scientific DataMonica Linnenbrink
 
Content and Outreach strategy
Content and Outreach strategyContent and Outreach strategy
Content and Outreach strategyJenneva Vargas
 

Andere mochten auch (20)

Job Talk (2012): University of Western Ontario
Job Talk (2012): University of Western OntarioJob Talk (2012): University of Western Ontario
Job Talk (2012): University of Western Ontario
 
Job Talk: Research (2013) - Kennesaw State University
Job Talk: Research (2013) - Kennesaw State UniversityJob Talk: Research (2013) - Kennesaw State University
Job Talk: Research (2013) - Kennesaw State University
 
Job Talk
Job TalkJob Talk
Job Talk
 
Oxford Job Talk
Oxford Job TalkOxford Job Talk
Oxford Job Talk
 
Mock Job Talk
Mock Job TalkMock Job Talk
Mock Job Talk
 
2008 Osu Job Talk 12 05
2008 Osu Job Talk 12 052008 Osu Job Talk 12 05
2008 Osu Job Talk 12 05
 
JHU Job Talk
JHU Job TalkJHU Job Talk
JHU Job Talk
 
How to talk about your job
How to talk about your jobHow to talk about your job
How to talk about your job
 
Sarkari naukri for assistant professor job
Sarkari naukri for assistant professor jobSarkari naukri for assistant professor job
Sarkari naukri for assistant professor job
 
Self Sustainable Plants: the contribution of soil-borne beneficial microbes
Self Sustainable Plants: the contribution of soil-borne beneficial microbesSelf Sustainable Plants: the contribution of soil-borne beneficial microbes
Self Sustainable Plants: the contribution of soil-borne beneficial microbes
 
BIO 130 Physical and Chemical Properties of Soil Lab Quiz
BIO 130 Physical and Chemical Properties of Soil Lab QuizBIO 130 Physical and Chemical Properties of Soil Lab Quiz
BIO 130 Physical and Chemical Properties of Soil Lab Quiz
 
行動互聯網時代的新經濟與創新思維
行動互聯網時代的新經濟與創新思維行動互聯網時代的新經濟與創新思維
行動互聯網時代的新經濟與創新思維
 
Research, strategy, inspiration.
Research, strategy, inspiration.Research, strategy, inspiration.
Research, strategy, inspiration.
 
Roehe_Microbiology_Society_2016_Edinburgh_v3a
Roehe_Microbiology_Society_2016_Edinburgh_v3aRoehe_Microbiology_Society_2016_Edinburgh_v3a
Roehe_Microbiology_Society_2016_Edinburgh_v3a
 
DOAS Online Outreach Strategy, Phase 1
DOAS Online Outreach Strategy, Phase 1DOAS Online Outreach Strategy, Phase 1
DOAS Online Outreach Strategy, Phase 1
 
Strategy and Collaboration: The Keys to BHL Outreach Success
Strategy and Collaboration: The Keys to BHL Outreach SuccessStrategy and Collaboration: The Keys to BHL Outreach Success
Strategy and Collaboration: The Keys to BHL Outreach Success
 
ThinkVis 2013 - Content Outreach & Engagement - 02.03.2013
ThinkVis 2013 - Content Outreach & Engagement - 02.03.2013ThinkVis 2013 - Content Outreach & Engagement - 02.03.2013
ThinkVis 2013 - Content Outreach & Engagement - 02.03.2013
 
Microbial Pathogenesis and Host Immune Response
Microbial Pathogenesis and Host Immune ResponseMicrobial Pathogenesis and Host Immune Response
Microbial Pathogenesis and Host Immune Response
 
Stakeholder Outreach and Engagement - Encouraging Use of New Scientific Data
Stakeholder Outreach and Engagement - Encouraging Use of New Scientific DataStakeholder Outreach and Engagement - Encouraging Use of New Scientific Data
Stakeholder Outreach and Engagement - Encouraging Use of New Scientific Data
 
Content and Outreach strategy
Content and Outreach strategyContent and Outreach strategy
Content and Outreach strategy
 

Ähnlich wie Job Talk Iowa State University Ag Bio Engineering

Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainAdina Chuang Howe
 
Trends In Genomics
Trends In GenomicsTrends In Genomics
Trends In GenomicsSaul Kravitz
 
ISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesAdina Chuang Howe
 
A genomic view on the diversification of Neotropical frogs
A genomic view on the diversification of Neotropical frogsA genomic view on the diversification of Neotropical frogs
A genomic view on the diversification of Neotropical frogsSantiago Montero-Mendieta
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
 
Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen ARDC
 
ISB nov 2014
ISB nov 2014ISB nov 2014
ISB nov 2014mcdonadt
 
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do thisANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do thisAdina Chuang Howe
 
IARU Global Challenges 2014 Cornell Tracking our decline
IARU Global  Challenges 2014 Cornell Tracking our declineIARU Global  Challenges 2014 Cornell Tracking our decline
IARU Global Challenges 2014 Cornell Tracking our declineSarah Cornell
 
Bioinformatics issues and challanges presentation at s p college
Bioinformatics  issues and challanges  presentation at s p collegeBioinformatics  issues and challanges  presentation at s p college
Bioinformatics issues and challanges presentation at s p collegeSKUASTKashmir
 
Reframing Phylogenomics
Reframing PhylogenomicsReframing Phylogenomics
Reframing PhylogenomicsJoe Parker
 
Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Sandra Binning
 

Ähnlich wie Job Talk Iowa State University Ag Bio Engineering (20)

Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Big data nebraska
Big data nebraskaBig data nebraska
Big data nebraska
 
Big Data Field Museum
Big Data Field MuseumBig Data Field Museum
Big Data Field Museum
 
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back AgainIowa State Bioinformatics BCB Symposium 2018 - There and Back Again
Iowa State Bioinformatics BCB Symposium 2018 - There and Back Again
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
Trends In Genomics
Trends In GenomicsTrends In Genomics
Trends In Genomics
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
ISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar SlidesISU ENVSCI690 Graduate Seminar Slides
ISU ENVSCI690 Graduate Seminar Slides
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
A genomic view on the diversification of Neotropical frogs
A genomic view on the diversification of Neotropical frogsA genomic view on the diversification of Neotropical frogs
A genomic view on the diversification of Neotropical frogs
 
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
 
Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen Supporting researchers in the molecular life sciences Jeff Christiansen
Supporting researchers in the molecular life sciences Jeff Christiansen
 
ISB nov 2014
ISB nov 2014ISB nov 2014
ISB nov 2014
 
rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritis
 
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do thisANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
ANL Soil Metagenomics 2014 Soil Reference Database - Let's do this
 
IARU Global Challenges 2014 Cornell Tracking our decline
IARU Global  Challenges 2014 Cornell Tracking our declineIARU Global  Challenges 2014 Cornell Tracking our decline
IARU Global Challenges 2014 Cornell Tracking our decline
 
Bioinformatics issues and challanges presentation at s p college
Bioinformatics  issues and challanges  presentation at s p collegeBioinformatics  issues and challanges  presentation at s p college
Bioinformatics issues and challanges presentation at s p college
 
Pathogen Genome Data
Pathogen Genome DataPathogen Genome Data
Pathogen Genome Data
 
Reframing Phylogenomics
Reframing PhylogenomicsReframing Phylogenomics
Reframing Phylogenomics
 
Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?
 

Mehr von Adina Chuang Howe

Merrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, NebraskaMerrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, NebraskaAdina Chuang Howe
 
2015 Soil Science of America Meeting
2015 Soil Science of America Meeting2015 Soil Science of America Meeting
2015 Soil Science of America MeetingAdina Chuang Howe
 
Adina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina Chuang Howe
 
Metagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON WorkshopMetagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON WorkshopAdina Chuang Howe
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesAdina Chuang Howe
 
EPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkEPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkAdina Chuang Howe
 

Mehr von Adina Chuang Howe (6)

Merrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, NebraskaMerrill Retreat 2018 - Nebraska City, Nebraska
Merrill Retreat 2018 - Nebraska City, Nebraska
 
2015 Soil Science of America Meeting
2015 Soil Science of America Meeting2015 Soil Science of America Meeting
2015 Soil Science of America Meeting
 
Adina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABEAdina's Faculty Introduction - ISU ABE
Adina's Faculty Introduction - ISU ABE
 
Metagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON WorkshopMetagenomic data analysis discussion NEON Workshop
Metagenomic data analysis discussion NEON Workshop
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 
EPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data TalkEPA 2013 Air Sensors Meeting Big Data Talk
EPA 2013 Air Sensors Meeting Big Data Talk
 

Kürzlich hochgeladen

PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilVinayVitekari
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxchumtiyababu
 

Kürzlich hochgeladen (20)

PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 

Job Talk Iowa State University Ag Bio Engineering

  • 1. RIDING THE BIG DATA TIDAL WAVE IN MODERN MICROBIOLOGY IOWA STATE UNIVERSITY MARCH 12, 2014 Adina Howe, PhD
  • 2. Outline of talk My multi-discipline career Biological sequencing: a game changer Research – computational focus: How to handle “big data” in biology Research – biological focus: The gut microbiome’s role in obesity? Future research: A flexible toolbox in a big playground
  • 3. Background Purdue University, BSME, Mechanical Engineering Purdue University, MS, Environmental Engineering (Sustainability)
  • 4. Background Purdue University, BSME, Mechanical Engineering Purdue University, MS, Environmental Engineering (Sustainability) University of Iowa, PhD, Environmental Engineering (Microbiology/Bioremediatio n)
  • 5. Background Purdue University, BSME, Mechanical Engineering Purdue University, MS, Environmental Engineering (Sustainability) University of Iowa, PhD, Environmental Engineering (Microbiology/Bioremediatio n) Michigan State University NSF Postdoc Math and Biology Fellow (cross- training) Microbial Ecology (Jim Tiedje) Bioinformatics (Titus Brown)
  • 6. Background Purdue University, BSME, Mechanical Engineering Purdue University, MS, Environmental Engineering (Sustainability) University of Iowa, PhD, Environmental Engineering (Microbiology/Bioremediatio n) Michigan State University NSF Postdoc Math and Biology Fellow (cross- training) Microbial Ecology (Jim Tiedje) Bioinformatics (Titus Brown) Computational Biologist Microbiology / Microbial Ecology
  • 7. Our shared challenges Climate Change Energy Supply USGCRP 2009 www.alutiiq.com http://guardianlv.com/ Human Health An understanding of microbial ecology
  • 9. Understanding community dynamics  Who is there?  What are they doing?  How are they doing it? Kim Lewis, 2010
  • 10. Gene / Genome Sequencing  Collect samples  Extract DNA  Sequence DNA  “Analyze” DNA to identify its content and origin Taxonomy (e.g., pathogenic E. Coli) Function (e.g., degrades cellulose)
  • 11. Cost of Sequencing Stein, Genome Biology, 2010 E. Coli genome 4,500,000 bp ($4.5M, 1992) 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 Year 0.1 1 10 100 1,000 10,000 100,000 1,000,000 DNASequencing,Mbpper$ 10,000,000 100,000,000
  • 12. Rapidly decreasing costs with NGS Sequencing Stein, Genome Biology, 2010 Next Generation Sequencing 4,500,000 bp (E. Coli, $200, presently) 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 Year 0.1 1 10 100 1,000 10,000 100,000 1,000,000 DNASequencing,Mbpper$ 10,000,000 100,000,000
  • 13. Effects of low cost sequencing… First free-living bacterium sequenced for billions of dollars and years of analysis Personal genome can be mapped in a few days and hundreds to few thousand dollars
  • 14. The experimental continuum Single Isolate Pure Culture Enrichment Mixed Cultures Natural systems
  • 15. The era of big data in biology Stein, Genome Biology, 2010 Computational Hardware (doubling time 14 months) Sanger Sequencing (doubling time 19 months) NGS (Shotgun) Sequencing (doubling time 5 months) 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 Year 0 1 10 100 1,000 10,000 100,000 1,000,000 DiskStorage,Mb/$ 0.1 1 10 100 1,000 10,000 100,000 1,000,000 DNASequencing,Mbpper$ 10,000,000 100,000,000 0.1 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000
  • 16. Postdoc experience with data 2003-2008 Cumulative sequencing in PhD = 2000 bp 2008-2009 Postdoc Year 1 = 50 Gbp 2009-2010 Postdoc Year 2 = 450 Gbp
  • 17. Flexibility towards embracing change. How to survive a data deluge? Experimen t Design Data Generatio n Workflow / Tools Data analysis Applied Solutions
  • 18. Reducing data volume: Assembly of Metagenomic Sequences MSU: C. Titus Brown and James Tiedje
  • 19. de novo assembly Compresses dataset size significantly Improved data quality (longer sequences, gene order) Reference not necessary (novelty) Raw sequencing data (“reads”) Computational algorithms Informative genes / genome
  • 21. Shotgun sequencing and de novo assembly It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 22. Practical Challenges – Intensive computing Howe et al, 2014, PNAS Months of “computer crunching” on a super computer
  • 23. Practical Challenges – Intensive computing Howe et al, 2014, PNAS Months of “computer crunching” on a super computer Assembly of 300 Gbp can be done with any assembly program in less than 14 GB RAM and less than 24 hours.
  • 24. Natural community characteristics  Diverse  Many organisms (genomes)
  • 25. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Sample 1x
  • 26. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Sample 1x Sample 10x
  • 27. Natural community characteristics  Diverse  Many organisms (genomes)  Variable abundance  Most abundant organisms, sampled more often  Assembly requires a minimum amount of sampling  More sequencing, more errors Sample 1x Sample 10x Overkill
  • 28. Digital normalization Brown et al., 2012, arXiv Howe et al., PNAS, 2014
  • 29. Digital normalization Brown et al., 2012, arXiv Howe et al., PNAS, 2014
  • 30. Digital normalization Brown et al., 2012, arXiv Howe et al., PNAS, 2014
  • 31. Digital normalization Brown et al., 2012, arXiv Howe et al., PNAS, 2014
  • 32. Digital normalization Brown et al., 2012, arXiv Howe et al., PNAS, 2014
  • 33. Digital normalization Brown et al., 2012, arXiv Howe et al., 2014, PNAS  Scales datasets for assembly up to 95% - same assembly outputs.  Genomes, mRNA-seq, metagenomes (soils, gut, water)
  • 34. Partitioning (khmer software) Pell et al, 2012, PNAS Howe et al., 2014, PNAS  Separates metagenomes by species  Parallel computing possible  Largest known published soil metagenome and assembly
  • 36. Tackling Soil Biodiversity  Grand Challenge effort – 10% of soil biodiversity sampled  Incredible soil biodiversity (estimate required 10 Tbp/sample)  “To boldly go where no man has gone before”: >60% Unknown 0 100 200 300 400 aminoacidmetabolism carbohydratemetabolism membranetransport signaltransduction translation folding,sortinganddegradation metabolismofcofactorsandvitamins energymetabolism transportandcatabolism lipidmetabolism transcription cellgrowthanddeath replicationandrepair xenobioticsbiodegradationandmetabolism nucleotidemetabolism glycanbiosynthesisandmetabolism metabolismofterpenoidsandpolyketides cellmotility TotalCount KO corn and prairie corn only prairie only Howe et al, 2014, PNAS
  • 37. Big data combined with microbiology will changes lives. 37
  • 38. The health and stability of the gut microbiome (in response to diet change) University of Chicago: Daina Ringus, PhD & Eugene Chang, MD38 Experimen t Design Data Generatio n Workflow / Tools Data analysis Applied Solutions
  • 40. Interactions between the microbiome and the environment 40 Source: Zhao, 2013 Obesity Intestinal inflammation IBD diseases Diet has a greater potential to shape the structure and function of gut than host genetics. Direct influence on health state
  • 41. How resilient is the microbiome? 41 In mice, recovery from long term shift to obesity-inducing diet In humans, microbiome rapidly and reproducibly recovers within 2 days (2013) In mice, rapid recovery from long term shift to obesity-inducing diet (2012)
  • 42. Is the gut community going viral? Reyes et al, Nature Review Microbiology, 2012 42 Bacterial cells Bacterial cells infected with bacteriophage Viruses (Bacteriophage)  Vary by individual (Minot et al., 2011)  Altered by diet and co-vary with bacteria (Minot et al., 2011)  Long term stable (Minot et al., 2013)  Largely temperate (Reyes et al., 2013) Prophage Who is in the gut microbiome?
  • 43. Is the gut community going viral? Reyes et al, Nature Review Microbiology, 2012 43
  • 44. Is the gut community going viral? Reyes et al, Nature Review Microbiology, 2012 44
  • 45. Is the gut community going viral? Reyes et al, Nature Review Microbiology, 2012 45
  • 46. Research Questions 46  What are the impacts of different diets on gut microbiome response?  What are the impacts of viruses in the gut microbiome (rapid alteration and resilient response?)  Multidisciplinary approach combining  novel experimental targeting of both bacterial and viral communities  metagenomic-based sequencing to characterize community
  • 47. Novel experimental design – targeted sampling of community fractions I. Total DNA (bacteria + prophage + viruses) TOT II. Virus-like particles (free-living viruses) VLP III. Induced prophage IND 47 Separation by density Chemically separate Separation by size Microbiome through faecal matter (non destructive sampling)
  • 48. Two baseline diets (with a perturbation) Low-fat (LF) baseline diet Milk-fat (MF) baseline diet Age (wk) 4 5 6 7 8 9 10 11 12 13 14 Diet Switch Washout (Return to BaselinBaseline Total community function: TOT metagenomic sequencing at weeks 8, 11, 14 Virome community function: VLP, IND metagenomic sequencing at weeks 8, 11, 14 Weight of mice and count of VLPS with microscopy Taxonomy analysis (only 16S rRNA gene) every week from week 8 – 14. 48 LF / 10% Fat / Complex Carbs MF / 37% Fat / Simple Sugars MF LF MF LF Fecal Samples
  • 49. Outcomes? 49 Low-fat (LF) baseline diet Milk-fat (MF) baseline diet Age (wk) 4 5 6 7 8 9 10 11 12 13 14 Diet Switch Washout (Return to BaselinBaseline LF / 10% Fat / Complex Carbs MF / 37% Fat / Simple Sugars MF LF MF LF Qualitative and Quantitative Measurements: Who is there? What are they doing? How much?
  • 50. How does the community change over time? DistancefromBaseline Baseline Intervention Washout DistancefromBaseline Baseline Intervention Washout Altered-Recovery Altered-Altered Measurements of gene abundance profile (200,000+ genes) reduced to a single distance measurement from the original community (ordination) Baseline Intervention Washout No Change DistancefromBaseline
  • 51. Rapid and resilient bacterial gut response after diet alteration DistancefromBaseline *** Baseline Intervention Washout
  • 52. Diet-specific functional total community recovery (mostly bacterial)52 0.000.050.10 DistancefromBaseline Baseline Diet Perturbed Washout ***
  • 53. 53 0.00.10.20.3 DistancefromBaseline Free living viruses in MF baseline are significantly altered without recovery. Baseline Diet Perturbed Washout ***
  • 54. Prophages in MF baseline are significantly altered without recovery.54 0.00.10.20.3 DistancefromBaseline Baseline Diet Perturbed Washout
  • 55. “Combat Zone” as diets change Milk-fat baseline (MF) mice have contrasting bacterial and viral responses, in which there is not a rapid recovery of viral communities
  • 56. Viral functions significantly changed during the milk fat baseline diet56 Decreases in Phage-related (p=0.01) Iron acquisition (p<0.01) Nucleotide metabolism (p=0.02) Carbohydrate metabolism (p=0.01) Motility and chemotaxis (p=0.03) Virulence and defense (p=0.03) Phage Iron Nucleotide Carbs Baseline - Change -- Washout Flagella
  • 57. 57  Bacteroides (Bacterioidetes)  Clostridium (Firmucutes)  Eubacterium (Firmucutes) Significant decrease in genes associated with MF baseline viruses Ratio of Firmucutes and Bacterioidetes associated with obesity Turnbaugh, 2008 Bacteriodes fragilis, Nutridesk.com C. difficile, Bioquell.ie National Geographic Turnbaugh, 2009
  • 58. Viromes potentially critical in gut microbiome response.  Members of gut microbiome community do not have co-occuring responses.  Loss of viral population and diversity is diet specific (related to a milkfat to lowfat diet transition)
  • 59. Ability to redirect structure and function of microbiome makes them pivotal drivers of health and disease Reyes et al, Nature Review Microbiology, 2012 59
  • 60. Virome directly causes host response Germ Free 11 week old mice (n = 3) Diet: Standard chow 3 week conventionalization 60 A “standard control” Microbiome: Uniform cecal content of standard chow mice Experimentally introduced viruses Mouse Treatment I: Lowfat baseline VLP Mouse Treatment 2: Milkfat baseline VLP Control: Buffer
  • 61. Significant decrease of intestinal inflammation in LF VLP treatments61 Pro-inflammatory cytokines in mucosal scrapings TNF-α INF-γ Proximal colon TNF-alpha(ng/gl) C ontrol LF VLPs M F VLPs 0 5 10 15 Proximal colon INF-gamma(ng/g) C ontrol LF VLPs M F VLPs 0 10 20 30 *
  • 62. Conclusions  Gut microbiome has reproducible and distinct responses to diet.  Viruses have a unique response to diet perturbations and do not co-occur with bacteria.  Viruses observed to cause inflammation in infected germ free mice.  Big data workflow enabled strategic sampling design providing unparalleled access to viruses of gut microbiome 62
  • 64. Data-discovery is a national investment.
  • 65. Data-driven biological investigations MICROBES IN ECOSYSTEMS NATURE WATER SOIL MICROBIOMES HUMANS/ANIMAL ENGINEERED WASTEWATER High Throughput Frameworks: Metagenomic Metatranscriptomic Metaproteomic More relevant model systems Improved biomarkers Scaling approaches Big data computation Data driven discovery
  • 66. Core research values  Research that matters  Developing scientific frameworks that enable open-science initiatives (reproducible science)  Computational and experimental integration  Scale and power to multi-disciplinary approaches  Team value  Flexibility
  • 67. Going viral: The role of the human gut phageome in inflammatory bowel disease Objectives:  Define and compare core phageomes associated with healthy and diseased gut microbiomes  Determine impact of disease-associated gut phageomes on development of disease in knockout mouse models (predisposed to disease) NIH, National Institute of Diabetes and Digestive and Kidney Diseases; National Institute of Allergy and Infectious Diseases ($3-5M) Source: Nature.com What is the role of host-phage dynamics in the development of intestinal diseases? Integration of multiple datasets Improved model systems and biomarkers
  • 68. Microbial drivers of carbon metabolism and warming DOE Biological and Environmental Research ($3M/3 years, 40% PI with ISU Kirsten Hofmockel, 2013-2016) Source: Oakridge National LaboratoryContributions: • Omic-based characterization of carbon cycling microorganisms in the soil • Novel approaches to target carbon cycling subsets of community • Improved soil genomic databases to enable future carbon studies Source: Oakridge National LaboratoryHow do microbes contribute to carbon cycling models? Big data scaling Integration of multiple datasets
  • 69. Large-scale characterization of global dark matter proteins in complex biological environments NIH – Development of Software and Analysis Methods for Biomedical Big Data in Targeted Areas of High Need (~$1M/3 years) Gordon and Betty Moore – Data Driven Discovery Investigator Awards ($1.5M / 5 years) Novel extension of current software tools: • Integration of growing volumes of global public datasets with scalable data-mining analysis • Lightweight data architecture to compare abundance and co- occurrence of sequencing patterns across multiple samples and associated metadata to elucidate information How do we access the novelty observed in metagenomic dataset Big data scaling Integration of datasets
  • 70. From field to food: The origin and fate of our microbiomes USDA Agriculture and Food Research Initiative ($1- 2.5M) • Identify and characterize under- researched foodborne microbial hazards and effective control strategies • Elucidate fate and dissemination of foodborne microbial hazards associated with produce production and processing Source: aboretum.umn.edu Where do harmful microbes in our food come from and how do we protect ourselves from them? Integration of multiple datasets Improved model systems and
  • 71. Acknowledgements  Funding  DOE Microbial Carbon Cycling Grant  NSF Postdoc Fellowship, Great Lakes Bioenergy Research Center  Microbiome: University of Chicago Digestive Diseases Research Core Pilot and Feasibility Grant  My Awesome INTER-DISCIPLINARY Team  C. Titus Brown (MSU) + lab (Bioinformatics)  James Tiedje (MSU) + lab (Microbial Ecology)  Daina Ringus (UC) (Microbiology / Mice)  Kirsten Hofmockel, Ryan Williams, Fan Yang (ISU)  Eugene Chang (UC)  Folker Meyer (ANL) 71
  • 73. Reducing data, not information. More efficient data storage and mining. Big data scaling approaches
  • 74. Storage of biological big data  What other sequences are connected to Sequence X?  Data broken into words of length “k” (k-mers)  Overlap (for assembly) = shared “word” Pell, PNAS, 2014 Howe, PNAS, AGTCAGTT Into its 4-mers: AGTC GTCA TCAG CAGT AGTT AGAAAGTC Into its 4-mers: AGAA GAAA AAAG CAGT AGTC
  • 75. Storage of biological big data  What other sequences are connected to Sequence X?  Data broken into words of length “k” (k-mers)  Overlap (for assembly) = shared “word”  How do we store “big data” words?  Bloom filter data structure  Efficient storage
  • 76. Do I have mail?  What other sequences are connected to Sequence X?  Data broken into bins of word length “k” (k-mers)  Overlap (for assembly) = shared “word”  How do we store “big data” words?  Bloom filter data structure  Mailbox analogy A-G H-R S-Z Pell, PNAS, 2014 Howe, PNAS,
  • 77.  Is Sequencing A connected to Sequence B?  Data broken into bins of word length “k” (k-mers)  Overlap (for assembly) = shared “word”  How do we store “big data” words?  Bloom filter data structure  Mailbox analogy – Efficient storage of information A-G H-R S-Z A-G* H-R S-Z No mail for Howe, 100% sure. A-G H-R* S-Z Possibly mail for Howe. Pell, PNAS, 2014 Howe, PNAS, Do I have mail?
  • 78.  Is Sequencing A connected to Sequence B?  Data broken into bins of word length “k” (k-mers)  Overlap (for assembly) = shared “word”  How do we store “big data” words?  Bloom filter data structure  Mailbox analogy – Efficient storage of information A-G H-R S-Z A-G H-R* S-Z G-N* A-F; O-T U-Z D-H* A-C; I-O P-Z Howe mail status: Mail possibility higher. Do I have mail?
  • 79.  Is Sequencing A connected to Sequence B?  Data broken into bins of word length “k” (k-mers)  Overlap (for assembly) = shared “word”  How do we store “big data” words?  Bloom filter data structure  Mailbox analogy – Efficient storage of information A-G H-R S-Z A-G H-R* S-Z G-N* A-F; O-T U-Z D-H A-C; I-O P-Z Howe mail status: No mail, 100% sure. Do I have mail?
  • 80. Bloom filter data structure  “Probablistic” data structure  Decrease of false positive rate with multiple bloom filters – “More likely I have mail”  No false negatives – “No mail. 100% sure”  For the win: both detects and counts presence of sequences (k-mers) and their connectivity efficiently  Is sequence A connected to sequence B? Pell, PNAS, 2014 Howe, PNAS,

Hinweis der Redaktion

  1. Hi, thanks for inviting me to talk to you today and taking the time to come learn a little bit of my research. I’ll admit that this is one of the longer talks I’ve ever given. I’ve had only one other 90 minute talk and it was to a group of Korean government officials who were interested in what a framework for big data analysis for a community might look like. But since they did not speak English, half of that talk was given by a translator. 
  2. Today, I’m going to give you an overview of my research which is very much interdisciplinary, living on the edge of both computational biology and microbial ecology (the study of natural communities in the environment). I’m going to tell you a bit about my background and what shaped the research I’ve become involved in. Then I’m going to highlight a couple research projects – the first will be more of a computational focus where I tell you about research that tackled the “data deluge” that emerged from fast changing sequencing technologies. Then I’ll tell you a story of how we used these tools on data to investigate how our bodies (which our in themsleves a natural ecosystem) respond to dietary changes? Then finally, I’ll conclude with a discussion on where I view these efforts going in the future.
  3. Folks often ask me how I went from a Mechanical Engineering degree to microbial ecology, as its not the most conventional track. And I actually think if you talk to most people in the field now, many of them have arrived here in unconventional paths. As a ME at Purdue, I had the opportunity to do two internships where I looked at the environmental impacts of industrial machinery. In particular, my junior year, I worked at Exxon Mobil evaluating sustainable replacments for outdated compressors on an oil platform which pump the unrefined oil from the platform to the refinery. This got me very interested in understanding the environmental impacts of economic decisions and how we should evaluate them, and this brought me into a program that was in its first year at Purdue in Environmental Engineering focusing on sustainability research. It was here where I first learned about microbiology and how it impacts our lives in so many different ways – visiting a wastewater treatment plant was a bit of a life changing moment for me. I really fell in love with the natural ability of these invisbile lifeforms in creating the world around us. So I went to grad school at the University of Iowa where I worked on how to monitor the activity of microbes which degrade pollutants in both groundwater and soil and one of the main challenges I had was that we were always workign with “model organisms” which didn’t necessarily match what was in the natural environment. After my PhD, the field of metagenomes was in its infancy but was being touted as a huge opportunity for studying natural environments. Jim Tiedje at MSU needed someone to start working with this sort of data, and thought I had no experience in it, I was willing to give it a try, and with the support of the NSF and Titus Brown at MSU, that’s what we did. I must’ve done a decent job of it because I was recruited by Argonne National Lab a couple years ago to provide support for some projects ongoing locally there.
  4. Folks often ask me how I went from a Mechanical Engineering degree to microbial ecology, as its not the most conventional track. And I actually think if you talk to most people in the field now, many of them have arrived here in unconventional paths. But something that I think is shared is that you pursue research that is enabling and that can make a real difference. As a ME at Purdue, I had the opportunity to do two internships where I looked at the environmental impacts of industrial machinery. In particular, my junior year, I worked at Exxon Mobil evaluating sustainable replacments for outdated compressors on an oil platform which pump the unrefined oil from the platform to the refinery. This got me very interested in understanding the environmental impacts of economic decisions and how we should evaluate them, and this brought me into a program that was in its first year at Purdue in Environmental Engineering focusing on sustainability research. It was here where I first learned about microbiology and how it impacts our lives in so many different ways – visiting a wastewater treatment plant was a bit of a life changing moment for me. I really fell in love with the natural ability of these invisbile lifeforms in creating the world around us. So I went to grad school at the University of Iowa where I worked on how to monitor the activity of microbes which degrade pollutants in both groundwater and soil and one of the main challenges I had was that we were always workign with “model organisms” which didn’t necessarily match what was in the natural environment. After my PhD, the field of metagenomes was in its infancy but was being touted as a huge opportunity for studying natural environments. Jim Tiedje at MSU needed someone to start working with this sort of data, and thought I had no experience in it, I was willing to give it a try, and with the support of the NSF and Titus Brown at MSU, that’s what we did. I must’ve done a decent job of it because I was recruited by Argonne National Lab a couple years ago to provide support for some projects ongoing locally there.
  5. Folks often ask me how I went from a Mechanical Engineering degree to microbial ecology, as its not the most conventional track. And I actually think if you talk to most people in the field now, many of them have arrived here in unconventional paths. But something that I think is shared is that you pursue research that is enabling and that can make a real difference. As a ME at Purdue, I had the opportunity to do two internships where I looked at the environmental impacts of industrial machinery. In particular, my junior year, I worked at Exxon Mobil evaluating sustainable replacments for outdated compressors on an oil platform which pump the unrefined oil from the platform to the refinery. This got me very interested in understanding the environmental impacts of economic decisions and how we should evaluate them, and this brought me into a program that was in its first year at Purdue in Environmental Engineering focusing on sustainability research. It was here where I first learned about microbiology and how it impacts our lives in so many different ways – visiting a wastewater treatment plant was a bit of a life changing moment for me. I really fell in love with the natural ability of these invisbile lifeforms in creating the world around us. So I went to grad school at the University of Iowa where I worked on how to monitor the activity of microbes which degrade pollutants in both groundwater and soil and one of the main challenges I had was that we were always workign with “model organisms” which didn’t necessarily match what was in the natural environment. After my PhD, the field of metagenomes was in its infancy but was being touted as a huge opportunity for studying natural environments. Jim Tiedje at MSU needed someone to start working with this sort of data, and thought I had no experience in it, I was willing to give it a try, and with the support of the NSF and Titus Brown at MSU, that’s what we did. I must’ve done a decent job of it because I was recruited by Argonne National Lab a couple years ago to provide support for some projects ongoing locally there.
  6. Folks often ask me how I went from a Mechanical Engineering degree to microbial ecology, as its not the most conventional track. And I actually think if you talk to most people in the field now, many of them have arrived here in unconventional paths. But something that I think is shared is that you pursue research that is enabling and that can make a real difference. As a ME at Purdue, I had the opportunity to do two internships where I looked at the environmental impacts of industrial machinery. In particular, my junior year, I worked at Exxon Mobil evaluating sustainable replacments for outdated compressors on an oil platform which pump the unrefined oil from the platform to the refinery. This got me very interested in understanding the environmental impacts of economic decisions and how we should evaluate them, and this brought me into a program that was in its first year at Purdue in Environmental Engineering focusing on sustainability research. It was here where I first learned about microbiology and how it impacts our lives in so many different ways – visiting a wastewater treatment plant was a bit of a life changing moment for me. I really fell in love with the natural ability of these invisbile lifeforms in creating the world around us. So I went to grad school at the University of Iowa where I worked on how to monitor the activity of microbes which degrade pollutants in both groundwater and soil and one of the main challenges I had was that we were always workign with “model organisms” which didn’t necessarily match what was in the natural environment. After my PhD, the field of metagenomes was in its infancy but was being touted as a huge opportunity for studying natural environments. Jim Tiedje at MSU needed someone to start working with this sort of data, and thought I had no experience in it, I was willing to give it a try, and with the support of the NSF and Titus Brown at MSU, that’s what we did. I must’ve done a decent job of it because I was recruited by Argonne National Lab a couple years ago to provide support for some projects ongoing locally there.
  7. There are several grand challenges that our society is currently facing which I think are of paramount importance. These are predicting and managing the impacts of climate change, finding sustainable sources of liquid fuels, and understanding the emerging pandemics facing human health in recent years. From carbon emissions from land use (which is magnitudes more than that of car emissions), degrading cellulosic biomass, to pathogens in our bodies, microbes are involved in complex communities that drive the health and productivity of either our natural resources or our own bodies. And its buidling up the expertise to ask
  8. My research explores these complex communities. These microbial communities are all connected, the food we eat contains microbes which then we “introduce into the environment” (mainly through wastewater treatment”), and then these and other microbes then impact biogeochemical cycling which affects the global climate cycle and the flow of nutrients in natural systems. As I talk about my research today, then I want to be sure to emphasize that they are broadly applicable.
  9. We’ve known about the importance of these environmental microbes for a long time, and much research has been spent answering three seemingly simple questions. One of the reasons this has been historically so challenging is something known as the great plate count anomoly. We know that there is a diverse world of microbes out there, but when we go into the laboratory and try to study their characteristics, we cannot grow them.
  10. When first automated DNA sequencing machines came online in the late 80s, microbiologists had a new way to ask questions like who is there and what are they doing? If we could access the microbe that we were interestd in, we could extract and sequence its DNA. We could then compare this DNA to previously seen DNA, and we could then identify the “Who” and “What” assuming and add to the encyclopedia of genes we had information about.
  11. Sequencing opened up the door to start building a catalog of some observed key microbial players. Iit was expensive but effective. Many of the choices of who got sequenced was driven here mostly by health and biotech. And this is the same set of reference genes that are still in use today. This graph here shows how expensive sequencng was in the early 90s, and how with early sequencing technologies this cost has changed over the years.
  12. What changed the field was the invention of next generation technlogies, bascially allowing the throughput of these automated sequencers to be much higher and the cost of sequencing much cheaper. So cheap in fact that instead of sequencing only one bacteria you could start sequencing multiple, even bacteria from complex environments.
  13. You may have seen this in the news and recently highlighted in NPR under the subject of personalized medicine and how it getting to the point where we can all have our genomes sequenced as a baseline for our future health.
  14. This sequencing also opens up the door to start studyig not only single isolates, but all the organims in a natural system. So then the question is not only who is there and what they are doing? But what are they doing together and how?
  15. With this growth and opportunity, however, has come other challenges. we are now dealing with though is that the growht of sequencing technology is growing more rapidly than the computers used to even the store the data on, let alone the types of analysis that we need to make this data informative.
  16. 25x million times….And this is when I started my postdoc. To give some very concrete examples. Within the first year of my postdoc, the data I had to analyze grew from the largest known soil metagenome (a collection of environmental DNA sequences) at 50 million reads to about 40x that within literally 9 months. At that time, we were already overwhelmed with this much data. And to put this in the perspective of other datasets that were avilable at the time.
  17. So where as I had spent a lot of time learning about how to grow bacteria and design an experiment during my PhD, now I was faced with an experiment that was designed to collect a lot of essential data but no way to start analyzing the data simply because available tools wouldn’t work.
  18. One of the most effective ways to reduce genomic sequencing data is do something which is referred to as genomic or in our case metagenomic assembly.
  19. Assembly is the process of rebuilding the original genome from the fragments of sequences we get from a sequencing machine. Essentially, its solving a puzzle where you look for overlaps of sequences among shredded information to predict a consensus sequence. If you do this process without any previous information (without a guided reference), you would call this process de novo assembly. Assembly has several advantages.
  20. I want to emphasize that the difference between a single genome assembly (like that of a pure culture) vs metagenomic assembly (like that of DNA from a complex environment of soil) is a huge difference of scale. And with this difference, come multiple challenges.
  21. Again, assembly is the process of trying to come up with a consensus sequence based on finding overlaps in small fragments. Here is an example of how “an assembly” of a sampling of the novel “A Tale of two cities”. You’ll notice here that because you have enough sampling of this sentence, you can get a good guess of what the original information would look like. You’ll also notice that there are some obstacles in getting the right solution, there are mistakes in the sampling which is analagous to sequencing errors that you would have to decide some criteria to estimate. In this example, we are coming up with a solution of one sentence using 8 fragments. In metagenomic assembly, you are trying to come up with hundreds to thousands to even millions of genomes using billions of fragments. And to do this, you have to compare each fragment to every other one in the dataset, making it very computationally intensive.
  22. To give you an idea of what computational intensive means, even the smallest dataset that I had at the beginning of my postdoc required several months on a supercomputer, something having over 100 GB of RAM. These were resources I simply didn’t have at this time. And for my larger datasets, there was simply nothing I could do with them, they would essentially crash any available assembly program that existed. So I had to come up with a way to deal with all of this data or essentially, there were a handful of Pis that had just invested tens of thousands of dollars in a project where we couldn’t tractably handle the datasets.
  23. To give you an idea of what computational intensive means, even the smallest dataset that I had at the beginning of my postdoc required several months on a supercomputer, something having over 100 GB of RAM. These were resources I simply didn’t have at this time. And for my larger datasets, there was simply nothing I could do with them, they would essentially crash any available assembly program that existed. So I had to come up with a way to deal with all of this data or essentially, there were a handful of Pis that had just invested tens of thousands of dollars in a project where we couldn’t tractably handle the datasets. I’m going to tell you now about how we wee able to do this and there actually two different strategies we had to combine.
  24. So one of the first things we thought about is what makes natural communities different than single organisms and there are two main factors. One is that natural communities are diverse. There are multiple genomes, and even potentially millions of species, in a sample. And this is represented here by the presence of red, blue, and green organisms. Another main difference is that these organisms are present at a variable abundance in nature, some are highly abundant some are not.
  25. Firstly, let me acknowledge that assembly for single organisms (especially bacterial) is relatively mature. So one of the first things we thought about is what makes natural communities different than single organisms and there are two main factors. One is that natural communities are diverse. There are multiple genomes, and even potentially millions of species, in a sample. And this is represented here by the presence of red, blue, and green organisms. Another main difference is that these organisms are present at a variable abundance in nature, some are highly abundant some are not.
  26. Firstly, let me acknowledge that assembly for single organisms (especially bacterial) is relatively mature. So one of the first things we thought about is what makes natural communities different than single organisms and there are two main factors. One is that natural communities are diverse. There are multiple genomes, and even potentially millions of species, in a sample. And this is represented here by the presence of red, blue, and green organisms. Another main difference is that these organisms are present at a variable abundance in nature, some are highly abundant some are not.
  27. Firstly, let me acknowledge that assembly for single organisms (especially bacterial) is relatively mature. So one of the first things we thought about is what makes natural communities different than single organisms and there are two main factors. One is that natural communities are diverse. There are multiple genomes, and even potentially millions of species, in a sample. And this is represented here by the presence of red, blue, and green organisms. Another main difference is that these organisms are present at a variable abundance in nature, some are highly abundant some are not. A strategy we came up with was can we come up with a way to come up with the minimal dataset that you need for assembly, discarding these reads from this overkill section?
  28. From a sequencing standpoint then, what we see is that for a given genome (represented here as a dotted line), we start sampling fragments from it.
  29. As we sample more, we will have some sequences which will have errors in it.
  30. And we’ll keep sequencing this genome, randomly sampling different parts of it. We’ll get to a point, where we’ll have enough sequences where we can make a good guess at what the original sequence may have looked like.
  31. For example, here we have a total of 6 sequences for which this particular part highlighted by the black arrow where we can be confident in saying we know what that is. From experience, I know this number here should be about 6 sequences to get an accurate assembly. So anything beyond this 6 is excessive or redundant information.
  32. So we can discard or set aside this read and not use it for our assembly. And that actually turns out to be a good thing because in discarding this information, we’re actually removing data with errors in it.
  33. In the end, we end up with a minimal dataset needed for an assembly of the dataset here in pink and a redundant set of information which we have set aside. In setting aside these reads here in the red, we actually get to discard sequencing errors which actually ends up in improved results for assembly. So eessentially, what we’ve shown is that the assembly of all this data and just the pink data ends up with at least the same assembly if not improved ones. In assembling just the pink dataset though, we’re able to reduce the amount of data we’re working with up to 95% in some environmental datasets I’m working with.
  34. Another tool we’ve developed to deal with biolgoical big data is a lightweight data structure that can break apart these datasets by connectivity.
  35. The system that these methods were developed for was sequencing data that was investigating soil biodiversity in both managed and natural soil systems. Soil biodiversity is amazing. Great Prairie – world’s most fertile. Important reference site for the biological baseis and ecosystems of soil microbial communities. It sequesters most carbon, produces large amount of biomass anually, key for biofuels and security. We know surprisingly little about the identities and functions of the microbes inhabiting soil,” With applications of DNA sequencing, the field was really excited about how we could now gauge this specific ecological niche and its responsiveness to change. Once we came up with how do deal with the data, and sift through the gleaned information, it was a sobering reality check on just how hard a challenge these environments will be.
  36. Overall, many funcitons are shared between corn and prairies soils. Interestingly, prairie soils have much many more unique functions (indicated here as blue bars) compared to unique functions in the corn (here green). This result may reflect the varying management history of these two soils. Unlike the prairie soils, which have never been tilled, the corn soils have been cultivated for more than 100 y and have had annual additions of animal manure that potentially could enrich specific metabolic pathways with decreased diversity.
  37. I’m fortunate to have worked on many projects in which I feel that this is true.
  38. Ok, let’s talk about the gut microbiome and how it responds to different diets. This study is a collaboration with the UC.
  39. In In recent years, there has been a growing appreciation for the fact that, as humans, we are in fact supraorganisms composed of both human and microbial cells, and as such we carry two sets of genes, those encoded in our own genome and those encoded in our microbiota. We genetically inherit only ~1% of our genes from our parents, and the remaining ~99% is mainly acquired from the immedi- ate environment when we are born. Importantly, all the genes in our body, whether human or microbiome encoded, have the potential to have an impact on our health. The gut is the most densely colonized microbial community in the human body and is also one of the most diverse. The gut functions as a chemostat, a continuous culture system for microorganisms (mostly bacteria) in which fresh nutrients enter the system and cultured microorganisms leave at a relatively constant rate. Approximately 1.5 kg of bacteria are resident in our gut, and 50% of our faecal matter biomass is bacterial cells.
  40. These gut microbiota interacts with our genetics and our environment (mainly diet) to influence our health. The gut microbiota releases toxins, such as lipopolysaccharides, and beneficial metabolites, such as vitamins and short-chain fatty acids, to damage or nourish humans. We know that diet has a greater potential to shape the structure and function of the gut microbiota than host genetics, thus influencing our health state directly.
  41. There are two key efforts that have looked into the response of gut communities to diet changes – one was Zhang et al which worked in mice.
  42. After reading these studies, a key question I had was has anyone looked into the viral components of the human gut. Much of the gut microbiome literature focuses only on the bacterial component of the gut, but we know that viruses are abundant in the gut environment, present at a ratio of 1:1. So beyond bacterial cells, there are…viruses as prophage… As far as I know, there are only 3 studies that have looked at the gut virome. From these preliminary studies, …
  43. The majority of these viruses are phages, or viruses that only infect bacteria. These viruses are pivotal driver of gut health and disease as these phages are able to redirect the structure and function of the entire gut microbial community.
  44. Specifically, phages can alter the fitness and function of bacterial populations through their transfer of genetic material,
  45. skew the abundance of bacterial population by infection, and drive the evolution of the community with their diversity and modifications of bacterial hosts. The broad range of subtle and robust effects that phages can exert on the gut microbial community makes them key targets in understanding health and the pathogenesis of gut diseases. Overlooked
  46. We know that we’ve previously seen in bacterial communities…
  47. Acess throught faecal matter.
  48. We targeted each of these communities in mice that had been fed a baseline diet, then switched diets, and then returned to their original diet. This would allow us to see what was being altered by diet and how much and if functions returned to normal if returned to the baseline diet (or washout). We did this on two diets to see if they had distinct reponses. Additionally, we took 16S rRNA DNA samples from fecal samples every week of the experiment so we could get a better resolution of the changes on community structure. The diets that we are studying here are basically diffefrent in their fat content. There is a low fat and milk fat diet where there is about a four fold difference in diet. But another thing to note is that these diets also differ in the amount of corn starch and sugar…and the LF diet has more corn starch – so complex carbs and the MF diet has more simple sugars, like sucrose. Finally also note that I designed this experiment to really access the virome part of the microbiome, much more so than has previously been looked at.
  49. A key result I would like to discuss is How different communities (both bacterial and viral) change over the course of this experiment. To talk about this result, I’m going to show you the change of abundances over of over 200,000 genes in a way that you can visually interpret. These analyses products are actually another whole talk about challenges of presenting biologial big data. Basically, to see how communities change over time, I’m going to estimate how different the are as a “distance from their original baseline communities). You can imagine three possible ideal sitautions.
  50. The data allows me to look at not only “if communities change” but “how communities change” One of the strongest signals in the viruses in the MF baseline over time is the decrease of phage related functions which is accompanied by a decreaes in the richness and diveristy of the viral communities. (BLUE) The strongest signal is that phage functions significantly in free living virus communities, and that there is loss of both the abundance of the free living viral community membership and diversity during MF diet. This corresponds to siginficant decreases in functions encoded within genes of this commuity. Reduced availability for the total community, through viral infection.
  51. When we looked at significant changes of sequences associated with different organisms in the MF diet, the phyla Bacetrioidets and Firmicutes showed significant decreases, especially in viruses related to these hosts. This is consistent with previous reports.
  52. We know that diet can be a cause of obesity and that we all respond to diferent diets. Yet, we find that in general, bacteria in our guts our resliient to change. Then, my thought is that there must be something else that is directing our bodies response to diet change…and I would suggest that it could be viral populations…
  53. These viruses are likely to be a pivotal driver of gut health and disease as these phages are able to redirect the structure and function of the entire gut microbial community.
  54. Potential consequences of a temperate phage life cycle in the human gut. Metagenomic studies of viruses Nature Reviews | Microbiology suggest that a temperate lifestyle is dominant in the distal human gut, in contrast to the the lytic lifestyle observed in open oceans. This temperate lifestyle can have benefits for the phage and the bacterial host, and can alter phage–host dynamics. Integration as a prophage (part a) protects the host from superinfection, effectively ‘immunizing’ the bacterial host against infection from the same or a closely related phage. Furthermore, the genes encoded by the phage genome may expand the niche of the bacterial host by enabling metabolism of new nutrient sources (for example, carbohydrates), providing antibiotic resistance, conveying virulence factors or altering host gene expression. This temperate (lysogenic) life cycle allows phage expansion in a 1:1 ratio with the bacterial host. If the prophage conveys increased fitness to its bacterial host, there will be an increased prevalence of the host and phage in the microbiota. Induction of a lytic cycle (part b) can follow a lysogenic state and can be triggered by environmental stress. As a consequence, bacterial turnover is accelerated and energy utilization is optimized through a ‘phage shunt’, in which the debris remaining after lysis is used as a nutrient source by the surviving bacterial population. Furthermore, a bacterial subpopulation that undergoes lytic induction sweeps away other sensitive species and increases the niche for survivors (that is, bacteria that already have the specific phage integrated into their genome). Periodic induction of prophages can also lead to a constant-diversity dynamic139, which helps maintain community structure and functional efficiency. Novel infections or infections of novel bacterial hosts by phages (part c) bring the benefit of horizontally transferred genes and create selective pressure on the hosts for diversification of their phage receptors, which are often involved in carbohydrate utilization. HGT, horizontal gene transfer.
  55. For the final part of my talk today, I wanted to present some of my ideas for future work, especially work that I feel would be successful here at Iowa State with your expertise and resources.
  56. I hope I’ve presented to you how I’ve been able to leverage next generation sequencing data and big data in biology to start enabling dat driven discovery in this field. Iowa State has done what I think is a really smart thing in identifying it as a great opportunity for research in the future. This parallels an heavy investment at both government and private funding agencies. Philip E. Bourne, Ph.D., as the first permanent Associate Director for Data Science (ADDS). Dr. Bourne is expected to join the NIH in early 2014. “Phil will lead an NIH-wide priority initiative to take better advantage of the exponential growth of biomedical research datasets, which is an area of critical importance to biomedical research. The era of ‘Big Data’ has arrived, and it is vital that the NIH play a major role in coordinating access to and analysis of many different data types that make up this revolution in biological information,” said Collins.
  57. As my research is broadly applicable in multiple enviornments within a continuum. I’m also very multidisciplinary and my research occurs on a continuum connecting High Thoughput Data discovery frameworks, big data sacling approaches, and trying to identify the best model systems or biomarkers to investigate complex natural communities.
  58. In the next few slides, I’m going to go through some ideas I’ve had for grants that I will be writing in the next 5 years. But before I do that, I wanted to emphasize what I think are my core research values.
  59. Buliding on viruses and their impacts on human health, a natural extension of the research I presented on the gut microbiome and diet change is to extend to the human system. The NIH is heaveily invested into questions like this.
  60. The grant here is actually unique among the next slides in that it is already funded for the next few years. This is a grant with Kirsten Hofmockel here at ISU, where we’ve been funded to look at microbial drivers of carbon metabolism and warming. Our key goal here is to understand…
  61. Unknowns in soil….A large opportunity for data driven discovery…
  62. Finally,another area of interest I have is the origin and fat of our microbiomes.
  63. So we’ve talked about digital normalization as an effective way of compressing the data or reducing it for assembly. Another strategy we used to deal with very large and complex metagenomes was to come up with a way to break it into pieces that could be analyzed separately. If you’re workign on a jigsaw puzzle, this would be akin to thinking about separating out different colored pieces to work on separatetly. To do this, we first needed a more efficient way to both store and query our data. And next I’ll tell you about how we did this.
  64. The main challenge for dealing with our datasets is that we could not figure out if a Sequence A was connected to sequence B within practical resources – both time and computation. The first trick we could use is that we could break each sequencing read into words of length k. This way rather than comparing long sequences and aligning them and checking for overlaps, we could just look for the presence of the same “word” in two sequences to say that they should be connected together. This trick is a common one used in assembly but isn’t enough for our dataset volume. We simply had too many words to store efficiently.
  65. So we came up with the application of a data structure called a bloom filter. And its this data structure that allowed us to start storing our huge datasets. A good way to think about bloom filters is to think about what I call the mailbox analogy.
  66. Say you have to store the mail of everyone in this room in 3 slots. A solution would be to divide these three slots let people know if they had mail. For example, if there is a marker in a slot that contained your last name, you would go check for mail.
  67. So if an asterix here represents that there is mail in that slot, and I wanted to know if I had mail, there are a couple of outcomes. One is that a slot that doesn’t represent my last name has mail. In that case, I know for sure that I do not have mail. Another possibility is that my representative slot would have mail. I could then possibly have mail or not have mail.
  68. So you can see how we can use these smaller mailboxes to query whether or not we have mail for a large number of people hopefully.
  69. If I wanted to make this even more efficeint, I could add a couple more mailbox setups with varying divisions of last names. So if I came into this mailroom, and looked for mail, I could see that mailbox 1 and 2 say I could have mail but since mailbox 3 is empty, I know I don’t have mail. And its likely someone else who has mail in the two other mailboxes.
  70. False negatives = structure sequences that should not be connected will never be identified as being connected False positives = sequences may be identified as connected when its not And you can see how this sort of storage system allows us to have no false negatives. So we apply this same strategy to check for the presence or absense of a sequence, or more specifically a word of length k. And this allows us then to ask is Sequence A connected to Sequence B?