SlideShare ist ein Scribd-Unternehmen logo
1 von 53
C.Titus Brown
Assistant Professor
MMG, CSE, BEACON
Michigan State University
May 2014
ctb@msu.edu
Large-scale transcriptome sequencing of non-model
organisms: coping mechanisms
We practice open science!
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog (‘titus brown blog’)
 Twitter: @ctitusbrown
 Grants on LabWeb site: http://ged.msu.edu/research.html
 Preprints available.
Everything is > 80% reproducible.
We practice open science!
Everything discussed here:
 Code: github.com/ged-lab/ ; BSD license
 Blog: http://ivory.idyll.org/blog (‘titus brown blog’)
 Twitter: @ctitusbrown
 Grants on LabWeb site: http://ged.msu.edu/research.html
 Preprints available.
Everything is > 80% reproducible by you.
The challenges of non-model
transcriptomics
 Missing or low quality genome reference.
 Evolutionarily distant.
 Most extant computational tools focus on model organisms –
 Assume low polymorphism (internal variation)
 Assume reference genome
 Assume somewhat reliable functional annotation
 More significant compute infrastructure
…and cannot easily or directly be used on critters of interest.
Outline
1. Challenges of non-model transcriptomics.
2. Lamprey: too much data, not enough genome
3. Digital normalization as a coping mechanism
4. …applied to Molgulid ascidians…
5. …and back to lamprey.
6. More transcriptome challenges
7. What’s next? (Implications of free data + free
data analysis.)
Sea lamprey in the Great Lakes
 Non-native
 Parasite of
medium to large
fishes
 Caused
populations of
host fishes to
crash
Li Lab /Y-W C-D
The problem of lamprey:
 Diverged at base of vertebrates; evolutionarily
distant from model organisms.
 Large, complicated genome (~2 GB)
 Relatively little existing sequence.
 We sequenced the liver genome…
Lamprey has incomplete genomic sequence
J. Smith et al., PNAS 2009
Evidence of somatic recombination; 100s of
mb of sequence eliminated from genome
during development.
More recent evidence (unpub, J. Smith et
al.) suggests that this loss is
developmentally regulated, results in
changes in gene expression (due to loss of
genes!), and is tissue specific.
Liver genome is not the entire
genome.
Lamprey tissues for which we have mRNAseq
embryo stages (late blastula,
gastrula, neurula, 22b, neural-
crest migration, 24c1,24c2)
metamorphosis 3 (intestine,
kidney)
ovulatory female head skin
adult intestine
metamorphosis 4 (intestine,
kidney)
preovulatory female eye
adult kidney
metamorphosis 5 (liver, intestine,
kidney)
preovulatory female tail skin
brain paired
metamorphosis 6 (intestine,
kidney)
prespermiating male gill
freshwater (gill, intestine, kidney)
metamorphosis 7 (intestine,
kidney)
mature adult male rope tissue
larval (gill, kidney, liver, intestine) monocytes
spermiating male gill
juvenile (intestine, liver, kidney) brain (0,3,21 dpi)
spermiating male head skin
lips spinal cord (0.3.21 dpi)
supraneural tissue
metamorphosis 1 (intestine,
kidney) spermiating male muscle
small parasite distal intestine,
kidney, proximal intestine
metamorphosis 2 (liver, intestine, salt water (gill, intestine)
Assembly
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was the age of
wisdom, it was the age of foolishness
…but for lots and lots of fragments!
Shared low-level
transcripts may not
reach the threshold
for assembly.
Main problem (4 years ago):
We have a massive amount of data that
challenges existing computers when we try to
assemble it all together.
Solution: Digital normalization
(a computational version of library normalization)
Suppose you have a dilution
factor ofA (10) to B(1). To get
10x of B you need to get 100x
ofA! Overkill!!
This 100x will consume disk
space and, because of errors,
memory.
We can discard it for you…
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
 Is single pass: looks at each read only once;
 Does not “collect” the majority of errors;
 Keeps all low-coverage reads;
 Smooths out coverage of sequencing.
=> Enables analyses that are otherwise completely impossible.
Evaluating diginorm – how?
 Can’t assemble lamprey w/o diginorm; are
results any good & how would we know?
 Need comparative data set
 …ascidians!
Looking at the Molgula…
Putnam et al., 2008,
Nature.Modified from Swalla 2001
Sea squirts!
Molgula oculata
Molgula occulta
Molgula oculata Ciona intestinalis
Elijah Lowe; collaboration w/Billie Swalla
Tail loss and notochord genes
a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta
Notochord cells in orange Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996
Diginorm applied to Molgula embryonic
mRNAseq
Substantial time
savings (3-5x) << RAM
Elijah Lowe
Question: does it matter what
assembly pipeline you use? (No)
3
70
25
1
36
13563
35
13
7
4 23 8 1
6
5
Diginorm V/O Raw V/O
Diginorm trinity Raw trinity
Numbers are putative orthologs (reciprocal best hits)
w/Ciona intestinalis,calculated for each assembly.
Elijah Lowe
Why Trinity vs Oases?
Trinity is slightly better at picking out isoforms.
Elijah Lowe
How complete are these
transcriptomes?
Elijah Lowe
Transcriptome assembly thoughts
 We can (now) assemble really big data sets, and
get pretty good results.
 We have lots of evidence (some presented here :)
that some assemblies are not strongly affected by
digital normalization.
(Note: normalization algorithm is now standard
part ofTrinity mRNAseq pipeline.)
Transcriptome results - lamprey
 Started with 5.1 billion reads from 50 different tissues.
(4 years of computational research, and about 1 month of
compute time, GO HERE)
Ended with:
Lamprey transcriptome basic stats
 616,000 transcripts (!)
 263,000 transcript families (!)
(This seems like a lot.)
Lamprey transcriptome basic stats
 616,000 transcripts
 263,000 transcript families
 Only 20436 transcript families have transcripts > 1kb
(compare with mouse: 17331 of 29769 genes are > 1kb)
So, estimation by thumb ~ not that off, for long transcripts.
Common vs rare genes
#transcripts
# samples
Camille Scott
Can look at transcripts by tissue --
Camille Scott
Too… many… samples…
Camille Scott
Presence/absence clustering
Expression-based clustering
Some known biology recapitulated; and… ???
Camille Scott
Next challenges
OK, we can deal with volume of data, make pretty
pictures, and ... Now what?
Contamination!
Both experimental or “real” contaminants are big probems.
Camille Scott
Pathway predictions vary dramatically
depending on data set, annotation
Likit Preeyanon
KEGG pathway
comparison
across several
different gene
annotation sets
for chicken
The problem of lopsided gene characterization is
pervasive: e.g., the brain "ignorome"
"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression
networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains.
The major distinguishing characteristic between these sets of genes is date of discovery, early
discovery being associated with greater research momentum—a genomic bandwagon effect."
Ref.: Pandey et al. (2014), PLoS One 11, e88889.Slide courtesy Erich Schwarz
Practical implications of diginorm
 Data is (essentially) free;
 For some problems, analysis is now cheaper
than data gathering (i.e. essentially free);
 …plus, we can run most of our approaches in
the cloud (per-hour rental compute
resources).
1. khmer-protocols
 Effort to provide standard “cheap” assembly
protocols for the cloud.
 Entirely copy/paste; ~2-6 days from raw
reads to assembly, annotations, and
differential expression analysis.
 Open, versioned, forkable, citable.
(“Don’t bother me unless it doesn’t work.”
Read cleaning
Diginorm
Assembly
Annotation
RSEM differential
expression
CC0; BSD; on github; in reStructuredText.
A few thoughts on our approach…
 Explicitly a “protocol” – explicit steps, copy-paste,
customizable.
 No requirement for computational expertise or significant
computational hardware.
 ~1-5 days to teach a bench biologist to use.
 $100-150 of rental compute (“cloud computing”)…
 …for $1000 data set.
 Adding in quality control and internal validation steps.
Can we crowdsource bioinformatics?
We already are! Bioinformatics is already a tremendously open and
collaborative endeavor. (Let’s take advantage of it!)
“It’s as if somewhere, out there, is a collection of totally free software
that can do a far better job than ours can, with open, published
methods, great support networks and fantastic tutorials. But that’s
madness – who on Earth would create such an amazing resource?”
-
http://thescienceweb.wordpress.com/2014/02/21/bioinformatics
-software-companies-have-no-clue-why-no-one-buys-their-
products/
2. Data availability is important for
annotating distant sequences
Anything else Mollusc Cephalopod
no similarity
Can we incentivize data sharing?
 ~$100-$150/transcriptome in the cloud
 Offer to analyze people’s existing data for free, IFF they open
it up within a year.
See:
• CephSeq white paper.
• “Dead Sea Scrolls & Open MarineTranscriptome Project”
blog post;
First results: Loligo
genomic/transcriptome resources
Putting other people’s sequences where my mouth is:
w/Josh Rosenthal and Benton Gravely
“Research singularity”
The data a researchers generates in their lab constitutes
an increasingly small component of the data used to reach
a conclusion.
Corollary:The true value of the data an individual investigator
generates should be considered in the context of aggregate data.
Even if we overcome the social barriers and incentivize sharing,
we are, needless to say, not remotely prepared for sharing all
the data.
Acknowledgements
Lab members involved Collaborators
 Adina Howe (w/Tiedje)
 Jason Pell
 Arend Hintze
 Qingpeng Zhang
 Elijah Lowe
 Likit Preeyanon
 Jiarong Guo
 Tim Brom
 Kanchan Pavangadkar
 Eric McDonald
 Camille Scott
 Jordan Fish
 Michael Crusoe
 Leigh Sheneman
 Billie Swalla (UW)
 Josh Rosenthal (UPR)
 Weiming Li, MSU
 Ona Bloom (Feinstein),
Jen Morgan (MBL), Joe
Buxbaum (MSSM)
Funding
USDA NIFA; NSF IOS; NIH;
BEACON.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Cloud assembly
protocols
Efficient graph
labeling &
exploration
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Efficient search
for target genes
Currentresearch
(khmer software)

Weitere ähnliche Inhalte

Was ist angesagt?

2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
c.titus.brown
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
c.titus.brown
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
c.titus.brown
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
c.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
c.titus.brown
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
c.titus.brown
 

Was ist angesagt? (20)

2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1
 
Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2
 
Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1
 
Genome Assembly 2018
Genome Assembly 2018Genome Assembly 2018
Genome Assembly 2018
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
2013 alumni-webinar
2013 alumni-webinar2013 alumni-webinar
2013 alumni-webinar
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genome
 
Future of metagenomics
Future of metagenomicsFuture of metagenomics
Future of metagenomics
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
Aug2015 analysis team spiral genetics
Aug2015 analysis team spiral geneticsAug2015 analysis team spiral genetics
Aug2015 analysis team spiral genetics
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
Lets Make a Mammoth
Lets Make a Mammoth  Lets Make a Mammoth
Lets Make a Mammoth
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
Jan2016 bio nano han cao
Jan2016 bio nano han caoJan2016 bio nano han cao
Jan2016 bio nano han cao
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
 

Andere mochten auch

Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10
John Doyle
 
VAFF 2014 sponsorship & partnership
VAFF 2014 sponsorship & partnershipVAFF 2014 sponsorship & partnership
VAFF 2014 sponsorship & partnership
reginal97
 
電子商務溝通 – 期末考
電子商務溝通 – 期末考電子商務溝通 – 期末考
電子商務溝通 – 期末考
guestaff5e9
 
Homework, Term 3 & 4
Homework, Term 3 & 4Homework, Term 3 & 4
Homework, Term 3 & 4
Takahe One
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 
Hohmann liber2006text
Hohmann liber2006textHohmann liber2006text
Hohmann liber2006text
Tina Hohmann
 
Ashleigh and Sarah's: Killer Whales
Ashleigh and Sarah's: Killer WhalesAshleigh and Sarah's: Killer Whales
Ashleigh and Sarah's: Killer Whales
Takahe One
 
Everest - Everything is a resource
Everest - Everything is a resourceEverest - Everything is a resource
Everest - Everything is a resource
Clément Escoffier
 

Andere mochten auch (20)

Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10Raving fans hofstra 11 30-10
Raving fans hofstra 11 30-10
 
Healthcare Costs And Performance in the OECD
Healthcare Costs And Performance in the OECDHealthcare Costs And Performance in the OECD
Healthcare Costs And Performance in the OECD
 
Homework, Term 3 & 4
Homework, Term 3 & 4Homework, Term 3 & 4
Homework, Term 3 & 4
 
Peixoto e Cury Advogados
Peixoto e Cury AdvogadosPeixoto e Cury Advogados
Peixoto e Cury Advogados
 
VAFF 2014 sponsorship & partnership
VAFF 2014 sponsorship & partnershipVAFF 2014 sponsorship & partnership
VAFF 2014 sponsorship & partnership
 
電子商務溝通 – 期末考
電子商務溝通 – 期末考電子商務溝通 – 期末考
電子商務溝通 – 期末考
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
Netiquette
NetiquetteNetiquette
Netiquette
 
Homework, Term 3 & 4
Homework, Term 3 & 4Homework, Term 3 & 4
Homework, Term 3 & 4
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Hohmann liber2006text
Hohmann liber2006textHohmann liber2006text
Hohmann liber2006text
 
18 Di Concetta
18 Di Concetta18 Di Concetta
18 Di Concetta
 
2012 wellcome-talk
2012 wellcome-talk2012 wellcome-talk
2012 wellcome-talk
 
Ashleigh and Sarah's: Killer Whales
Ashleigh and Sarah's: Killer WhalesAshleigh and Sarah's: Killer Whales
Ashleigh and Sarah's: Killer Whales
 
Everest - Everything is a resource
Everest - Everything is a resourceEverest - Everything is a resource
Everest - Everything is a resource
 
Professional responsibility seminar in cleveland
Professional responsibility seminar in clevelandProfessional responsibility seminar in cleveland
Professional responsibility seminar in cleveland
 
Pdi Southern California Slide Show
Pdi Southern California Slide ShowPdi Southern California Slide Show
Pdi Southern California Slide Show
 
ITP Instance Management Process V2
ITP Instance Management Process V2ITP Instance Management Process V2
ITP Instance Management Process V2
 

Ähnlich wie 2014 ucl

2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
c.titus.brown
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
c.titus.brown
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
c.titus.brown
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
c.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
c.titus.brown
 

Ähnlich wie 2014 ucl (20)

2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2014 davis-talk
2014 davis-talk2014 davis-talk
2014 davis-talk
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 database
 
Introduction to 16S Microbiome Analysis
Introduction to 16S Microbiome AnalysisIntroduction to 16S Microbiome Analysis
Introduction to 16S Microbiome Analysis
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 

Mehr von c.titus.brown

2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
c.titus.brown
 

Mehr von c.titus.brown (18)

2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 

Kürzlich hochgeladen

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 

Kürzlich hochgeladen (20)

Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 

2014 ucl

  • 1. C.Titus Brown Assistant Professor MMG, CSE, BEACON Michigan State University May 2014 ctb@msu.edu Large-scale transcriptome sequencing of non-model organisms: coping mechanisms
  • 2. We practice open science! Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog (‘titus brown blog’)  Twitter: @ctitusbrown  Grants on LabWeb site: http://ged.msu.edu/research.html  Preprints available. Everything is > 80% reproducible.
  • 3. We practice open science! Everything discussed here:  Code: github.com/ged-lab/ ; BSD license  Blog: http://ivory.idyll.org/blog (‘titus brown blog’)  Twitter: @ctitusbrown  Grants on LabWeb site: http://ged.msu.edu/research.html  Preprints available. Everything is > 80% reproducible by you.
  • 4. The challenges of non-model transcriptomics  Missing or low quality genome reference.  Evolutionarily distant.  Most extant computational tools focus on model organisms –  Assume low polymorphism (internal variation)  Assume reference genome  Assume somewhat reliable functional annotation  More significant compute infrastructure …and cannot easily or directly be used on critters of interest.
  • 5. Outline 1. Challenges of non-model transcriptomics. 2. Lamprey: too much data, not enough genome 3. Digital normalization as a coping mechanism 4. …applied to Molgulid ascidians… 5. …and back to lamprey. 6. More transcriptome challenges 7. What’s next? (Implications of free data + free data analysis.)
  • 6. Sea lamprey in the Great Lakes  Non-native  Parasite of medium to large fishes  Caused populations of host fishes to crash Li Lab /Y-W C-D
  • 7. The problem of lamprey:  Diverged at base of vertebrates; evolutionarily distant from model organisms.  Large, complicated genome (~2 GB)  Relatively little existing sequence.  We sequenced the liver genome…
  • 8. Lamprey has incomplete genomic sequence J. Smith et al., PNAS 2009 Evidence of somatic recombination; 100s of mb of sequence eliminated from genome during development. More recent evidence (unpub, J. Smith et al.) suggests that this loss is developmentally regulated, results in changes in gene expression (due to loss of genes!), and is tissue specific. Liver genome is not the entire genome.
  • 9. Lamprey tissues for which we have mRNAseq embryo stages (late blastula, gastrula, neurula, 22b, neural- crest migration, 24c1,24c2) metamorphosis 3 (intestine, kidney) ovulatory female head skin adult intestine metamorphosis 4 (intestine, kidney) preovulatory female eye adult kidney metamorphosis 5 (liver, intestine, kidney) preovulatory female tail skin brain paired metamorphosis 6 (intestine, kidney) prespermiating male gill freshwater (gill, intestine, kidney) metamorphosis 7 (intestine, kidney) mature adult male rope tissue larval (gill, kidney, liver, intestine) monocytes spermiating male gill juvenile (intestine, liver, kidney) brain (0,3,21 dpi) spermiating male head skin lips spinal cord (0.3.21 dpi) supraneural tissue metamorphosis 1 (intestine, kidney) spermiating male muscle small parasite distal intestine, kidney, proximal intestine metamorphosis 2 (liver, intestine, salt water (gill, intestine)
  • 10. Assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!
  • 11. Shared low-level transcripts may not reach the threshold for assembly.
  • 12. Main problem (4 years ago): We have a massive amount of data that challenges existing computers when we try to assemble it all together.
  • 13. Solution: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor ofA (10) to B(1). To get 10x of B you need to get 100x ofA! Overkill!! This 100x will consume disk space and, because of errors, memory. We can discard it for you…
  • 20. Digital normalization approach A digital analog to cDNA library normalization, diginorm:  Is single pass: looks at each read only once;  Does not “collect” the majority of errors;  Keeps all low-coverage reads;  Smooths out coverage of sequencing. => Enables analyses that are otherwise completely impossible.
  • 21. Evaluating diginorm – how?  Can’t assemble lamprey w/o diginorm; are results any good & how would we know?  Need comparative data set  …ascidians!
  • 22. Looking at the Molgula… Putnam et al., 2008, Nature.Modified from Swalla 2001
  • 23. Sea squirts! Molgula oculata Molgula occulta Molgula oculata Ciona intestinalis Elijah Lowe; collaboration w/Billie Swalla
  • 24. Tail loss and notochord genes a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occulta Notochord cells in orange Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996
  • 25. Diginorm applied to Molgula embryonic mRNAseq
  • 26. Substantial time savings (3-5x) << RAM Elijah Lowe
  • 27. Question: does it matter what assembly pipeline you use? (No) 3 70 25 1 36 13563 35 13 7 4 23 8 1 6 5 Diginorm V/O Raw V/O Diginorm trinity Raw trinity Numbers are putative orthologs (reciprocal best hits) w/Ciona intestinalis,calculated for each assembly. Elijah Lowe
  • 28. Why Trinity vs Oases? Trinity is slightly better at picking out isoforms. Elijah Lowe
  • 29. How complete are these transcriptomes? Elijah Lowe
  • 30. Transcriptome assembly thoughts  We can (now) assemble really big data sets, and get pretty good results.  We have lots of evidence (some presented here :) that some assemblies are not strongly affected by digital normalization. (Note: normalization algorithm is now standard part ofTrinity mRNAseq pipeline.)
  • 31. Transcriptome results - lamprey  Started with 5.1 billion reads from 50 different tissues. (4 years of computational research, and about 1 month of compute time, GO HERE) Ended with:
  • 32. Lamprey transcriptome basic stats  616,000 transcripts (!)  263,000 transcript families (!) (This seems like a lot.)
  • 33. Lamprey transcriptome basic stats  616,000 transcripts  263,000 transcript families  Only 20436 transcript families have transcripts > 1kb (compare with mouse: 17331 of 29769 genes are > 1kb) So, estimation by thumb ~ not that off, for long transcripts.
  • 34. Common vs rare genes #transcripts # samples Camille Scott
  • 35. Can look at transcripts by tissue -- Camille Scott
  • 36. Too… many… samples… Camille Scott Presence/absence clustering
  • 37. Expression-based clustering Some known biology recapitulated; and… ??? Camille Scott
  • 38. Next challenges OK, we can deal with volume of data, make pretty pictures, and ... Now what?
  • 39. Contamination! Both experimental or “real” contaminants are big probems. Camille Scott
  • 40. Pathway predictions vary dramatically depending on data set, annotation Likit Preeyanon KEGG pathway comparison across several different gene annotation sets for chicken
  • 41. The problem of lopsided gene characterization is pervasive: e.g., the brain "ignorome" "...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect." Ref.: Pandey et al. (2014), PLoS One 11, e88889.Slide courtesy Erich Schwarz
  • 42. Practical implications of diginorm  Data is (essentially) free;  For some problems, analysis is now cheaper than data gathering (i.e. essentially free);  …plus, we can run most of our approaches in the cloud (per-hour rental compute resources).
  • 43. 1. khmer-protocols  Effort to provide standard “cheap” assembly protocols for the cloud.  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis.  Open, versioned, forkable, citable. (“Don’t bother me unless it doesn’t work.” Read cleaning Diginorm Assembly Annotation RSEM differential expression
  • 44. CC0; BSD; on github; in reStructuredText.
  • 45. A few thoughts on our approach…  Explicitly a “protocol” – explicit steps, copy-paste, customizable.  No requirement for computational expertise or significant computational hardware.  ~1-5 days to teach a bench biologist to use.  $100-150 of rental compute (“cloud computing”)…  …for $1000 data set.  Adding in quality control and internal validation steps.
  • 46. Can we crowdsource bioinformatics? We already are! Bioinformatics is already a tremendously open and collaborative endeavor. (Let’s take advantage of it!) “It’s as if somewhere, out there, is a collection of totally free software that can do a far better job than ours can, with open, published methods, great support networks and fantastic tutorials. But that’s madness – who on Earth would create such an amazing resource?” - http://thescienceweb.wordpress.com/2014/02/21/bioinformatics -software-companies-have-no-clue-why-no-one-buys-their- products/
  • 47. 2. Data availability is important for annotating distant sequences Anything else Mollusc Cephalopod no similarity
  • 48. Can we incentivize data sharing?  ~$100-$150/transcriptome in the cloud  Offer to analyze people’s existing data for free, IFF they open it up within a year. See: • CephSeq white paper. • “Dead Sea Scrolls & Open MarineTranscriptome Project” blog post;
  • 49. First results: Loligo genomic/transcriptome resources Putting other people’s sequences where my mouth is: w/Josh Rosenthal and Benton Gravely
  • 50. “Research singularity” The data a researchers generates in their lab constitutes an increasingly small component of the data used to reach a conclusion. Corollary:The true value of the data an individual investigator generates should be considered in the context of aggregate data. Even if we overcome the social barriers and incentivize sharing, we are, needless to say, not remotely prepared for sharing all the data.
  • 51.
  • 52. Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jason Pell  Arend Hintze  Qingpeng Zhang  Elijah Lowe  Likit Preeyanon  Jiarong Guo  Tim Brom  Kanchan Pavangadkar  Eric McDonald  Camille Scott  Jordan Fish  Michael Crusoe  Leigh Sheneman  Billie Swalla (UW)  Josh Rosenthal (UPR)  Weiming Li, MSU  Ona Bloom (Feinstein), Jen Morgan (MBL), Joe Buxbaum (MSSM) Funding USDA NIFA; NSF IOS; NIH; BEACON.
  • 53. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Cloud assembly protocols Efficient graph labeling & exploration Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Efficient search for target genes Currentresearch (khmer software)