Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
170216 jts agbt_final
1. Nanopore sequencing of a human genome
AGBT
February 2017
Jared Simpson
Ontario Institute for Cancer Research
&
Department of Computer Science
University ofToronto
2. Overview
• Brief intro to nanopore sequencing and signal-level analysis
• Initial results from sequencing a human genome at high coverage
• Direct detection of cytosine methylation
2
3. Overview
• Brief intro to nanopore sequencing and signal-level analysis
• Initial results from sequencing a human genome at high coverage
• Direct detection of cytosine methylation
3
Disclosure: ONT provides research funding to my lab
13. Signal-level Analysis
13
First generation basecallers used hidden Markov models
Calculate best sequence of 6-mers using
gaussian emission distributions
Latest basecallers using recurrent neural networks to
capture longer range dependencies
Predict 6-mer label for each event using RNN,
assemble 6-mers into basecalled reads
input output
Examples: R7 metrichor, nanocall
Examples: R9 metrichor, nanonet, deepnano
14. Nanopolish
• Toolkit for working with signal-level data
• Originally designed for improving a consensus sequence using the
signals from multiple reads
• Extended to call SNPs for the mobile Ebola sequencing project
• A few new features in development that I’ll talk about later
14
P(D|S)
…ACTACGATCGACTTA…
…ACTACCATCGACTTA…
…ACTACGATC-ACTTA…
…ACTACCATC-ACTTA…
-176
-191
-168
-185
D1
D2
Dn
…
15. Human Sequencing Consortium
15
Group of MinION users put flowcells together to sequence a
human genome. Data publicly available on github/AWS.
https://github.com/nanopore-wgs-consortium/NA12878
16. Flowcell Yield
16
Fresh
Cell
DNA
Rapid
Library
Kit
Birmingham East
Anglia Nottingham British
Columbia Santa
Cruz
Credit: John Tyson
Fresh
Cell
DNA
Yield
2.3Gb
17. Average Read Length
17Credit: John Tyson
Birmingham East
Anglia Nottingham British
Columbia Santa
Cruz
Average
Read
Length
Fresh
Cell
DNA
Rapid
Library
Kit
Fresh
Cell
DNA
6.6kb
20. NG50 45.8Mbp
Canu Assembly + HiC Scaffolding
Topological domains in mammalian genomes identified by analysis of chromatin interactions. Dixon et al. Nature Methods (2012)
Scaffolding of long read assemblies using long range contact information. Ghurye et al. Biorxiv (2016)
Credit: Sergey Koren
21. Human Assembly Consensus
21
- This is a work in progress
- Polishing a single 6 Mbp chr20 contig
- 30X data set is NA12878 consortium data only
- 60X data set includes 30X PCR-amplified NA12878 provided by ONT
- stats calculated from bwa mem alignments to GRCh38
- differences that matched an NA12878 variant were not consider an error
Assembly Percent Identity
canu 94.8%
canu+racon 96.5%
canu+racon+nanopolish (30X) 99.1%
canu+racon+nanopolish (60X) 99.4%
canu+racon+nanopolish (60X) + pilon 99.6%
22. Human Assembly Consensus
22
Remaining errors: some homopolymers,
microsatellites, large differences that are difficult
to polish.
Assembly Percent Identity
canu 94.8%
canu+racon 96.5%
canu+racon+nanopolish (30X) 99.1%
canu+racon+nanopolish (60X) 99.4%
canu+racon+nanopolish (60X) + pilon 99.6%
- This is a work in progress
23. Data Improvements
• Homopolymers have been the main source of residual errors in
ONT assemblies
• Earlier basecallers would collapse homopolymers to a 6-mer
• Newest ONT basecaller (“scrappie”) estimates the homopolymer
length
23
24. Scrappie basecalls
24
Sequence
Scrappie
Nanonet
[0 - 28]
[0 - 23]
[0 - 28]
Metrichor Coverage
Nanonet Coverage
Scrappie Coverage
Metrichor
23,389,180 bp 23,389,200 bp 23,389,220 bp 23,389,240 bp 23,389,260 bp 23,389,280 bp 23,389,300 bp
135 bp
chr20
Credit: Sergey Koren
25. Data Improvements
• “2D” reads use a hairpin adaptor to read both strands of DNA
• 2D reads have higher accuracy but with high variance due to effect of base
pairing after the pore
• New method of reading both strands: 1D2
25Figure provided by ONT
26. 1D2 Accuracy
26
- All runs are E. coli
- R7.3-2D from Nick Loman
- R9.2-2D from OICR
- R9.4-1D2 provided by ONT
0.0
0.1
0.2
75 80 85 90 95 100
accuracy
density
version
R7.3−2D
R9.2−2D
R9.4−1D^2
27. Next step: Human Assembly v2
• Improvements to basecalling and read accuracy will help our
assembly
• Using scrappie reads from chromosome 20 improves canu
assembly from ~95% to 97.5%
• Planning to polish this assembly
27
35. Summary
• Improvements to ONT throughput and accuracy have allowed
sequencing of large genomes
• Initial human assembly is highly contiguous
• Further improvements to accuracy are needed, new “scrappie” basecaller is
promising
• 5-mC can be detected directly from signal-level data, concordant
with bisulfite sequencing
35
36. Acknowledgements
OICR: Matei David, Phil Zuzarte, Jonathan Dursi, Lars Jorgensen
Birmingham: Nick Loman, Josh Quick
Johns Hopkins University: Winston Timp, Rachael Workman
NHGRI: Sergey Koren, Adam Phillippy
NA12878 Sequencing: Matt Loose, John Tyson, Miten
Jain, Mark Akeson, Justin O’Grady and many others
contributing analysis
Oxford Nanopore Technologies: Chris Wright, Clive
Brown, Tim Massingham