Genetic data is the foundation of precision medicine. Next Generation Sequencing(NGS) enable us to get our whole genome data in affordable cost. How to process huge amount of NGS data effectively ?
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Next Generation Sequencing Informatics - Challenges and Opportunities
1. Name, Title, Department
Date
Genome Insight . Inside Genome
Next Generation Sequencing Informatics
- Challenges and Opportunities
Chung-Tsai Su, Ph.D
Atgenomix, CTO
2017/03/16 @TMU
11. Title TextPrecision Medicine Initiative
most medical treatments are designed for the
"average patient" as "one-size-fits-all-approach" that
is successful for some patients but not for others.
15. Title TextCost per Genome
https://www.genome.gov/images/content/costpergenome2015_4.jpg
Next Generation Sequencing (NGS)
debuted
Illumina HiSeq X10
debuted
Human Genome Project (HGP)
Completed
Precision Medicine Initiative
announced
17. Title TextThe First $1,000 Genome
http://systems.illumina.com/systems/hiseq-x-sequencing-system.html
18. Title TextExpectation of Data Processing Power
for illumina HiSeq X Ten
• A cluster of 10 HiSeq X instruments
• Capable of sequencing up to 18,000 whole human genomes each year
• Has a run cycle of ~3 days and produces ~150 genomes each run cycle
• Running the industry standard BWA+GATK analysis pipeline to perform this
analysis on a reasonably high-end (Dual Intel Xeon E5-2697v2 CPU – 12 core,
2.7 GHz with 96 GB DRAM) compute server takes ~24 hours per genome.
• To achieve the required throughput of 150 genomes every three days, at least
50 of these servers are required.
• Should meet a target of ~28 minutes for the completion of the mapping, aligning,
sorting, de-duplication and variant calling of each genome.
30. Title TextScale-Up vs. Scale-Out
Horizontal Scaling
(More Nodes)
VerticalScaling
(BiggerNodes)
More expensive server
(Big Memory, Many CPU cores)
Many commodity nodes
31. Title TextHadoop – HDFS, Spark, YARN
https://www.tutorialspoint.com/hadoop/hadoop_introduction.htm
33. Title TextAn Example of Word Count
http://7xjbdi.com1.z0.glb.clouddn.com/word-count-as-mapreduce.png
34. Title TextPerformance Comparison
Method
Time
(Hours)
Note
Single-thread GATK Process 16.60 Single Node
20-threads GATK Process 5.49 Single Node
40-threads GATK Process 5.48 Single Node
SeqsLab Piper with 40 Cores (GATK) 1.20 9 Nodes
SeqsLab Piper with 80 Cores (GATK) 0.99 9 Nodes
*By NA12878
36. Title TextNGS 102
Read
Mapping
Variant
CallingBAM
5百萬變異
怎麼分析?
Annotation
~ 3 days for 150
genomes per run
100 GB / sample
(30X)
~ 12 hours / sample#
100 GB / sample
(30X)
~ 70 hours / sample*
# using BWA-MEM (20 threats)
* using GATK Haplotype Caller (single threat)
$ using Annovar
5 GB / sample 10 GB / sample
~ 3 hours / sample$
∞ hours / sample
VCF VCF
FASTQ
37. Title TextChallenges
Read
Mapping
Variant
CallingBAM Annotation
Dry LabWet Lab
• Hard to screen variant efficiently
• Hard to identify causal variant
effectively
• Sample purification
• Capture capability
• Hard to distinguish variants and
sequencing error
• Hard to detect structural variants
• Hard to provide sufficient evidence
• Hard to deal with database error
• Sequencing error
• Poor in repeat and low complexity
regions
• Pseudo gene
• Short read length
• Long turn-around time
38. Title TextSequencing Error
Dr. Watson
Discoverer of the structure of DNA in 1953
< 0.1%
~ 1 %
Chimp
Most closest species to human
Sequencing Error = ~1%
Dr. Su
Cofounder of Atgenomix in 2015
~ 0.1%