Next Generation Sequencing Informatics - Challenges and Opportunities

Name, Title, Department
Date
Genome Insight . Inside Genome
Next Generation Sequencing Informatics
- Challenges and Opportunities
Chung-Tsai Su, Ph.D
Atgenomix, CTO
2017/03/16 @TMU

Title TextQuestions Before My Talk (1/3)
Confidential - Anome internal use only. © 2016 Anome, Inc.
Q: How many of you have your own “genetic” data?
http://tools.thermofisher.com/content/sfs/prodImages/high/GeneChip_generic_microarray_300dpi_white.jpg
https://img.buzzfeed.com/buzzfeed-static/static/2016-10/26/13/campaign_images/buzzfeed-prod-fastlane01/23andme-anne-wojcicki-next-generation-sequencing-2-24817-1477502838-3_dblbig.jpg
http://www.kenkon.com.tw/data/editor/images/edf08288b46f7acb4a26ffca9a8c1d82.jpg

Title TextRight to Know and Freedom to Choose
Confidential - Anome internal use only. © 2016 Anome, Inc.http://www.fashiongonerogue.com/angelina-jolie-movies-style-photos/2/

Q: How many of you heard about Next-Generation Sequencing
(NGS)?
http://www.anthonybaldor.com/thoughts-and-notes/bioblog/next-generation-sequencing/

Q: How many of you heard about Spark?
http://vignette2.wikia.nocookie.net/vsbattles/images/4/49/Spocks.png/revision/latest?cb=20160501151337
http://spark.apache.org/images/spark-logo-trademark.png

Title TextMy Logical Thinking
Precision Medicine Human Genome Next Generation
Sequencing
Big Data Technology

Title TextAbout Me
Education
1994-1998 NTNU ICE Bachelor
1998-2000 NTU CSIE Master
2000-2007 NTU CSIE Ph.D
Experience
2000-2005 Avamax Engineer
2007-2008 NTU Post Doc.
2008-2015 Trend Micro Big Data Architect
2015-now Atgenomix CTO & Cofounder

Title TextAgenda
• Precision Medicine
• Technology
• Challenges
• Opportunities
• Lessons Learned
-Next Generation Sequencing
-Data Science
-Big Data Technology
http://i2.kym-cdn.com/photos/images/newsfeed/000/653/558/88e.jpg

Title TextToday Medicine in US
Confidential - Anome internal use only. © 2016 Anome, Inc.https://www.washingtonpost.com/news/to-your-health/wp/2016/05/03/researchers-medical-errors-now-third-leading-cause-of-death-in-united-states/?utm_term=.066000857138

Title TextPrecision Medicine Initiative
most medical treatments are designed for the
"average patient" as "one-size-fits-all-approach" that
is successful for some patients but not for others.

Title TextImprecision Medicine
http://www.nature.com/news/personalized-medicine-time-for-one-person-trials-1.17411
(高膽固醇)(關節炎)
(精神分裂症) (胃灼熱)
(憂鬱症) (氣喘) (牛皮癬)
(孔羅氏症)
(多發性硬化症) (嗜中性白血球低下)
Confidential - Anome internal use only. © 2016 Anome, Inc.http://www.fda.gov/downloads/ScienceResearch/SpecialTopics/PersonalizedMedicine/UCM372421.pdf

Title TextThe 1000 Genomes Project
Confidential - Anome internal use only. © 2016 Anome, Inc.http://www.nature.com/nature/journal/v526/n7571/full/nature15393.html

Next Generation Sequencing
(NGS)

Title TextCost per Genome
https://www.genome.gov/images/content/costpergenome2015_4.jpg
Next Generation Sequencing (NGS)
debuted
Illumina HiSeq X10
debuted
Human Genome Project (HGP)
Completed
Precision Medicine Initiative
announced

Title TextIllumina Product
http://www.illumina.com/content/dam/illumina-marketing/documents/products/brochures/brochure_sequencing_systems_portfolio.pdf

Title TextThe First $1,000 Genome
http://systems.illumina.com/systems/hiseq-x-sequencing-system.html

Title TextExpectation of Data Processing Power
for illumina HiSeq X Ten
• A cluster of 10 HiSeq X instruments
• Capable of sequencing up to 18,000 whole human genomes each year
• Has a run cycle of ~3 days and produces ~150 genomes each run cycle
• Running the industry standard BWA+GATK analysis pipeline to perform this
analysis on a reasonably high-end (Dual Intel Xeon E5-2697v2 CPU – 12 core,
2.7 GHz with 96 GB DRAM) compute server takes ~24 hours per genome.
• To achieve the required throughput of 150 genomes every three days, at least
50 of these servers are required.
• Should meet a target of ~28 minutes for the completion of the mapping, aligning,
sorting, de-duplication and variant calling of each genome.

Title TextNGS 101
https://www.broadinstitute.org/gatk/img/cartoon-blackbox-workflow-web-blackblue.png
Web Lab Dry Lab

Title TextGATK Best Practice
http://cdn.vanillaforums.com/gatk.vanillaforums.com/FileUpload/eb/44f317f8850ba74b64ba47b02d1bae.png
4，5百萬變
異怎麼分析

Title TextRead Mapping
http://www.nature.com/nrg/journal/v13/n1/fig_tab/nrg3117_F1.html

Title TextVariant Calling
http://www.clcsupport.com/clcgenomicsworkbench/754/SNP-example.png

Title TextData Science
https://media.licdn.com/mpr/mpr/p/5/005/06d/041/02978e8.jpg

Title TextData Scientist
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
http://www.marketingdistillery.com/wp-content/uploads/2014/11/mds_f.png
http://buzzorange.com/techorange/2012/10/05/data-scientists-the-definition-of-sexy/

Title TextThe Three Facets of Data Science
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Title TextThe Three Facets of Precision Medicine
Clinic
Data
Science
Precision
Medicine

Title Text4V
Velocity
Volume
Variety
Veracity
MB GB TB PB
batch
periodic
near Real-Time
Real-Time

Title TextScale-Up vs. Scale-Out
Horizontal Scaling
(More Nodes)
VerticalScaling
(BiggerNodes)
More expensive server
(Big Memory, Many CPU cores)
Many commodity nodes

Title TextHadoop – HDFS, Spark, YARN
https://www.tutorialspoint.com/hadoop/hadoop_introduction.htm

Title TextMap/Reduce
http://railscarma.com/wp-content/uploads/2015/02/graphics1.gif

Title TextAn Example of Word Count
http://7xjbdi.com1.z0.glb.clouddn.com/word-count-as-mapreduce.png

Title TextPerformance Comparison
Method
Time
(Hours)
Note
Single-thread GATK Process 16.60 Single Node
20-threads GATK Process 5.49 Single Node
40-threads GATK Process 5.48 Single Node
SeqsLab Piper with 40 Cores (GATK) 1.20 9 Nodes
SeqsLab Piper with 80 Cores (GATK) 0.99 9 Nodes
*By NA12878

Title TextNGS 102
Read
Mapping
Variant
CallingBAM
5百萬變異
怎麼分析?
Annotation
~ 3 days for 150
genomes per run
100 GB / sample
(30X)
~ 12 hours / sample#
100 GB / sample
(30X)
~ 70 hours / sample*
# using BWA-MEM (20 threats)
* using GATK Haplotype Caller (single threat)
$ using Annovar
5 GB / sample 10 GB / sample
~ 3 hours / sample$
∞ hours / sample
VCF VCF
FASTQ

Title TextChallenges
Read
Mapping
Variant
CallingBAM Annotation
Dry LabWet Lab
• Hard to screen variant efficiently
• Hard to identify causal variant
effectively
• Sample purification
• Capture capability
• Hard to distinguish variants and
sequencing error
• Hard to detect structural variants
• Hard to provide sufficient evidence
• Hard to deal with database error
• Sequencing error
• Poor in repeat and low complexity
regions
• Pseudo gene
• Short read length
• Long turn-around time

Title TextSequencing Error
Dr. Watson
Discoverer of the structure of DNA in 1953
< 0.1%
~ 1 %
Chimp
Most closest species to human
Sequencing Error = ~1%
Dr. Su
Cofounder of Atgenomix in 2015
~ 0.1%

Title TextACMG Standard and Guidelines
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4544753/ Confidential - Anome internal use only. © 2016 Anome, Inc.

Title TextACMG Evidence Framework
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4544753/
Rule set:

Title TextAnnotation Database
Population Disease LOVD ENCODE
1000 Genomes (phase III)
ESP6500
dbSNP
ExAC
DGV
YanHuang
CLINVAR
COSMIC
DVD
OMIM
ARVC
Chrominum
COL4A
Coloncancer
EahadcoagulationFacator
Eurowabb
Eye
Globin
Mendelian
Mismatchrepairegenes
Monogenicdiabetes
Musculardystrophy
OI
Parkinson
RB1
RetinalHearing
Shared1
TSC
VUMC
Xchromsome
Zjucggm
DHS
H3K27AC
H3K4ME1
H3K4ME3
H3K9AC
TFBS
Functional Genome Context
dbNSFP
dbSCSNV
Macarthuretal
CGD
dbNSFP (gene)
GENCODE
GeneOntology
GWAS
Haploreg
HapmapGF
HapmapLD

Title TextGATK Best Practice (1/2)
https://software.broadinstitute.org/gatk/best-practices/

Title TextGATK Best Practice (2/2)

Title TextBut …
https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php

Title TextSNP and Indels

Title TextTypes of Structural Variation
Confidential - Anome internal use only. © 2016 Anome, Inc.http://www.nature.com/nmeth/journal/v9/n2/full/nmeth.1858.html

Title TextDetection Methods of Structural Variation

Title TextStructural Variation from 1000Genomes (1/2)
Variant Type Caller
Deletion (DEL)
GenomeStrip
Breakdancer
CNVnator
Delly
Variation Hunter
UWash RD
Pindel (Short Deletions)
multiple Copy Number Variation (mCNV)
UWash SSL
GenomeStrip
Duplications (DUP)
Delly
UWash RD
GenomeStrip
Inversions (INV) Delly
Mobile Element Insertions (MEI) MELT
Mitocondrial Insertions (NUMT) Dinumt

Title Text
Structural Variation from 1000Genomes (2/2)

Title Text$1,000 Whole Genome Sequencing (WGS)
Confidential - Anome internal use only. © 2016 Anome, Inc.https://www.technologyreview.com/s/600950/for-999-veritas-genetics-will-put-your-genome-on-a-smartphone-app/

Title TextWhy WGS?

Title Text$100 Genome
Confidential - Anome internal use only. © 2016 Anome, Inc.http://www.bio-itworld.com/2017/1/9/illumina-launches-novoseq-sequencers-aiming-replace-1900-sequencers.aspx

Title TextOld Drugs, New Uses
Confidential - Anome internal use only. © 2016 Anome, Inc.http://www.bbc.com/news/health-39253537

Title TextParadigm Shift
https://www.slideshare.net/TWilckens/inn-ventis-precision-medicine2014

Title TextVision of Precision Medicine

Title TextNext-Generation Biology
Confidential - Anome internal use only. © 2016 Anome, Inc.http://journals.plos.org/plosbiology/article?id=10.1371%2Fjournal.pbio.2002050
http://wp.sanger.ac.uk/barrettgroup/files/2012/08/expLab.jpg

Title TextGraph Genome
Confidential - Anome internal use only. © 2016 Anome, Inc.https://www.sevenbridges.com/graph/
https://vimeo.com/184983995

Title Text
http://www.currencyfundgroup.com/2015/03/07/scientists-are-developing-ways-to-edit-the-dna-of-tomorrows-children/
Back to The Beginning

Next Generation Sequencing Informatics - Challenges and Opportunities

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Next Generation Sequencing Informatics - Challenges and Opportunities

Ähnlich wie Next Generation Sequencing Informatics - Challenges and Opportunities (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Next Generation Sequencing Informatics - Challenges and Opportunities