40 Years of Genome Assembly: Are We Done Yet?

Adam M. Phillippy
Head, Genome Informatics Section
40 Years of Genome Assembly:
Are We Done Yet?
@aphillippy

1980
2014
2001
2012
1995
2020
2010
• Genome assembly’s 40th anniversary
• Rodger Staden (1979)
• “With modern fast sequencing techniques1,2 and
suitable computer programs it is now possible to
sequence whole genomes without the need of
restriction maps.”
A strategy of DNA sequencing employing computer programs. Staden. Nucleic Acids Research (1979)

• Shotgun assembly
• 1995: Haemophilus influenzae
• 1995: Overlap graphs
• 1995: de Bruijn graphs
1980
2014
2001
2012
1995
2020
2010

• The first human genome
• 2000: Celera Assembler
• 2001: The human genome
1980
2014
2001
2012
1995
2020
2010

1980
2001
2012
1995
2020
2014
2010
• Shotgun sequencing era
Input
Extraction
Sequencing
Assembly
Output

1980
2001
2012
1995
2020
2014
2010
• Long-read shotgun sequencing

• First complete de novo assemblies
• 2012: Bacteria (106 bp)
Class I Class II
Yersinia pestis
CO92
Esche
O26:H
Bacillus anthracis
Ames
0
20
0
161
16
171
1980
2014
2001
2012
1995
2020
2010

• 2014: Yeast (107 bp)
1980
2014
2001
2012
1995
2020
2010

• 2014: Yeast (107 bp)
• 2014: Drosophila (108 bp)
3L3R
2R
2L X
1980
2014
2001
2012
1995
2020
2010

• 2014: Yeast (107 bp)
• 2014: Drosophila (108 bp)
• ????: Human (109 bp)
1980
2014
2001
2012
1995
2020
2010

Assembly is solved:
Sequence all the things!

• HQ Reference assemblies
• >1 Mb contig N50
• Scaffolds == chromosomes
• 99.99% average base quality
• Sequencing Technology
• Long reads: PacBio
• Linked reads: 10x Genomics
• Optical maps: BioNano
• Cross linking: Arima Hi-C
Vertebrate Genomes Project
Erich Jarvis, chairperson – worldwide consortium of universities, museums, zoos, etc.
~250
~1,000
~10,000
G10K
~60,000
B10K, Bat1K
Orders
Families
Genera
Species

VGP Assembly Pipeline
PacBio
10XG
Contigging
+ Purging
Scaffolding
BioNano
Scaffolding
Hi-C
Gap-filling &
Curation
Final assembly
A
A
A
C TGGA
TGGGGA
TGGGGA
TGGGGA
A TGGGGA
Polishing
Scaffolding
exon 1 exon 2 exon 3
Primary
Alternate

• vgp.github.io
• 86 species currently posted
• 24 with all four data types
The GenomeArk
Jennifer Vashon of Maine Department of Inland Fisheries and Wildlife, left, and
UMass lynx team coordinator, Tanya Lama, with an adult male lynx from northern
Maine whose DNA was used to create first-ever whole genome for the species.
The lynx has since been released to the wild. (MassWildlife photo / Bill Byrne)

VGP Phase 1: What did we learn?

• Iterative assembly process is not ideal
• Errors carry over and are hard to correct
• Data integration is hard
• Most tools built for a single technology
• Little reward for building complex, integrated systems
• Need to decentralize
• Open data, standard formats, modular frameworks
• Nobody* likes building infrastructure
Assembly is hard

• P(Asm|Data) ∝ P(Data|Asm)
• Read coverage
• Hi-C heatmaps
• k-mer recovery
• Comparative annotation
Assembly validation is critical

• Cannot map short reads to repeats
• Therefore, cannot effectively polish/assemble with short reads
• Long read assemblies more accurate in repeats (e.g. HLA, rRNA)
• PacBio can exceed 99.999% accuracy (QV50)
Long read polishing is essential
In some regions, short-read polishing can actually harm the assembly

Oddballs
• Marmoset chimeras
• Zebra finch GRCs
• Platypus sex chrs (10!)
• Lamprey genome deletions
• Fish with spikes and stripes
Not all vertebrates are created equal
Contig N50 (Mb)
Repeats (%)

Mixed haplotypes can introduce indels
CGTTAAAGC
CGTTAAAGC
CGTTAAAGC
CGTTTAAGC
CGTTTAAGC
CGTTTAAAGC
CGTT-AAAGC
CGTT-AAAGC
CGTTTAA-GC
CGTTTAA-GC
P(sub) = 0.01
P(ins) = 0.12
P(del) = 0.02
P(mat) = 0.85
P(mat)^34 * P(sub)^2
3.983304e-07
P(mat)^36 * P(ins)^4
5.967691e-07<

Heterozygosity can lead to false duplications
P:
A:
FALCON-
Unzip
Finch Fish
Size (Gbp) 1.09 0.94 1.95 0.73
NG50 (Mbp) 3.0 0.6 2.6 0.02
BUSCO (c) 93.9 82.1 94.2 40.6
BUSCO (d) 5.0 3.3 20.8 3.4
1.2% 1.6%

Assemble the genomes
De novo assembly of haplotype-resolved genomes with trio binning.
Koren, Rhie, et al. Nature Biotechnology (2018)
×
DamSire
F1 cross
Parental
k-mers
Sire haplotype
Dam haplotype
Sire assembly Dam assembly
Unassigned

Correctly resolved alleles with TrioBinning
FALCON-
Unzip
TrioCanu
FALCON-
Unzip
TrioCanu
Size (Gbp) 1.09 0.94 1.05 1.06 1.95 0.73 1.37 1.36
NG50 (Mbp) 3.0 0.6 3.6 4.0 2.6 0.02 2.6 2.1
BUSCO (c) 93.9 82.1 94.4 93.3 94.2 40.6 91.6 92.7
BUSCO (d) 5.0 3.3 1.4 1.3 20.8 3.4 3.5 3.4
1.2% 1.6%

Esperanza: A nearly perfect diploid
125x PacBio coverage (~60x per haplotype), no Illumina polishing needed, TrioCanu haplotig NG50 70 Mbp, BUSCOs 94%
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 X
Dam (yak)
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 X
Sire (Highland) Esperanza

Can we finally finish the human
genome?

• The human reference genome is incomplete
• 368 unresolved issues, 102 gaps
• Segmental duplications, satellites, rDNAs
• Centromeres, telomeres, heterochromatin
• These gaps contain important information
• Missing reference sequence leads to analysis artifacts
• Variation in these gaps is unexplored (e.g. rDNAs)
• We don’t know what we don’t know…
We need to finish the genome

Our target: CHM13hTERT
Cell line from Urvashi Surti, Pitt; SKY karyotype from Jennifer Gerton and Tamara Potapova, Stowers
N=46; XX

• Repeats are long, reads are short
• “If the overlap is of sufficient length to distinguish
it from being a repeat in the sequence the two
sequences must be contiguous.”
— Rodger Staden, 1979
What’s the problem?

• How long are the repeats?
• 7 kbp LINEs
• 1 Mbp+ rDNA arrays
• 1 Mbp+ centromere arrays
• 10 Mbp+ heterochromatin blocks
• Coverage and accuracy matter too
• 1,000X of 100 bp reads at 100% accuracy? NO
• 10X of 10,000,000 bp reads at 100% accuracy, YES
• 100X of 100,000 bp reads at 90% accuracy, MAYBE?
How long do reads need to be, for human?
>50% of the genome

• Length at the expense of throughput
• Read lengths >1 Mbp possible
Ultra-long nanopore sequencing
Nanopore sequencing and assembly of a human genome with ultra-long reads.
Jain et al. Nature Biotechnology (2018)

• Prediction: 30x raw UL coverage == GRCh38
How much do we need?
Nanopore sequencing and assembly of a human genome with ultra-long reads.
Jain et al. Nature Biotechnology (2018)

• 30x Nanopore ultra-long
• Contig building
• 60x PacBio
• Polishing
• 50x 10x Genomics
• Polishing
• BioNano
• Structural validation
We need long reads. Lots of long reads

• Nanopore UL read length distribution is long tailed
It pays to go deep
repeat

• From May 1 – October 29, 2018
• 62 MinION/GridION flow cells
• 8.9M reads, 98 Gb, 1.6 Gb / cell
• N50 read length 76 kb
• 44 Gb in reads >100 kb
• Max read length 1.03 Mb
• Assembled with Canu
CHM13 sequencing
Now upwards of 90+ flow cells and counting…

The human genome, 2001
ref28 NG50 contig 0.5 Mbp

The human genome, 2019
CHM13 NG50 contig 75 Mbp (70x PacBio + 35x UL ONT)
13 14 15 16 17 18 19 20 21 22 X
1 2 3 4 5 6 7 8 9 10 11 12
Canu

The first complete assembly
of a human chromosome

• Unique structural variants from PacBio
• Unique k-mers confirmed by Duplex-Seq
Stitching across the X centromere

• Per read error rates between 5–15%
• Latest Nanopore > PacBio
• Consensus error rates >99.9%
• After Nanopore polishing QV30
• After PacBio polishing QV40
• BAC validation
• >85% of BACs at >99.8% idy
• v.s. 54% for prior PacBio asm
What about the error rate?
BAC analysis courtesy of Eichler lab @ UW
88.0 / 90.6 / 92.4

• ChrX GAGE gene locus
• 19 tandemly arrayed ~9.4 kb repeats
• Corrupted by mapping/polishing pipeline
Repeat collapse analysis
Mitchell Vollger @ UW

• Mappers prefer the “best” alignment
• Consensus can be of variable quality (patches)
• Best mapping not always the correct mapping
• Marker-based anchoring
• Increase number of secondary alignments returned
• Redefine mapping quality to measure single-copy k-
mer agreement between read and assembly
Unique k-mer mapping
Before:
After:

Centromere array validation
Jennifer Gerton @ Stowers

Centromere array validation
Beth Sullivan @ Duke
1.8 Mb
0.7 Mb
0.3 Mb

It’s time to finish the human genome

• Almost!
• Have proven it’s possible for the X chromosome
• T2T assembly of all chrs within the next 2 years
• Challenges
• REPEATS, REPEATS, REPEATS
• Heterozygosity: diploids, polyploids, metagenomes
• Nanopore-only consensus quality
• Targeted long-read sequencing
Are we there yet?

• github.com/nanopore-wgs-consortium/chm13
• Draft whole-genome assemblies
• Nanopore ultra-long reads
• 10x Genomics reads
• BioNano DLS (WashU)
• PacBio (SRA)
• Coming soon:
• Arima Genomics Hi-C
• PacBio CCS
• Strand-Seq
All CHM13 data is openly released

NHGRI
• Sergey Koren
• Arang Rhie
• Jim Mullikin
• Alice Young
• Shelise Brooks
• Valerie Maduro
• Gerard Bouffard
• Sofia Barreira
• Andy Baxevanis
• Nancy Hansen
• Karen Miga, UCSC
• Jennifer Gerton, Stowers
• Tamara Potapova, Stowers
• Beth Sullivan, Duke
• Tina Graves Lindsay, WashU
• Ira Hall, WashU
• Valerie Schneider, NCBI
• Kerstin Howe, Sanger
• Jo Wood, Sanger
• Matt Loose, Nottingham
• Nick Loman, Birmingham
• Urvashi Surti, Pitt (ret.)
Acknowledgements
Evan Eichler, Mitchel Vollger, Glennis Logsdon, David Porubsky, Melanie Sorensen

It’s time to finish the human genome
Google “t2t consortium” – I’ll be hiring in the fall
The Telomere-to-Telomere (T2T) consortium is an
open, community-based effort to generate the
first complete assembly of a human genome.

40 Years of Genome Assembly: Are We Done Yet?

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie 40 Years of Genome Assembly: Are We Done Yet?

Ähnlich wie 40 Years of Genome Assembly: Are We Done Yet? (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

40 Years of Genome Assembly: Are We Done Yet?