Vision and reflection on Mining Software Repositories research in 2024
40 Years of Genome Assembly: Are We Done Yet?
1. Adam M. Phillippy
Head, Genome Informatics Section
40 Years of Genome Assembly:
Are We Done Yet?
@aphillippy
2. 1980
2014
2001
2012
1995
2020
2010
• Genome assembly’s 40th anniversary
• Rodger Staden (1979)
• “With modern fast sequencing techniques1,2 and
suitable computer programs it is now possible to
sequence whole genomes without the need of
restriction maps.”
A strategy of DNA sequencing employing computer programs. Staden. Nucleic Acids Research (1979)
15. VGP Assembly Pipeline
PacBio
10XG
Contigging
+ Purging
Scaffolding
BioNano
Scaffolding
Hi-C
Gap-filling &
Curation
Final assembly
A
A
A
C TGGA
TGGGGA
TGGGGA
TGGGGA
A TGGGGA
Polishing
Scaffolding
exon 1 exon 2 exon 3
Primary
Alternate
16. • vgp.github.io
• 86 species currently posted
• 24 with all four data types
The GenomeArk
Jennifer Vashon of Maine Department of Inland Fisheries and Wildlife, left, and
UMass lynx team coordinator, Tanya Lama, with an adult male lynx from northern
Maine whose DNA was used to create first-ever whole genome for the species.
The lynx has since been released to the wild. (MassWildlife photo / Bill Byrne)
18. • Iterative assembly process is not ideal
• Errors carry over and are hard to correct
• Data integration is hard
• Most tools built for a single technology
• Little reward for building complex, integrated systems
• Need to decentralize
• Open data, standard formats, modular frameworks
• Nobody* likes building infrastructure
Assembly is hard
20. • Cannot map short reads to repeats
• Therefore, cannot effectively polish/assemble with short reads
• Long read assemblies more accurate in repeats (e.g. HLA, rRNA)
• PacBio can exceed 99.999% accuracy (QV50)
Long read polishing is essential
In some regions, short-read polishing can actually harm the assembly
21. Oddballs
• Marmoset chimeras
• Zebra finch GRCs
• Platypus sex chrs (10!)
• Lamprey genome deletions
• Fish with spikes and stripes
Not all vertebrates are created equal
Contig N50 (Mb)
Repeats (%)
23. Heterozygosity can lead to false duplications
P:
A:
FALCON-
Unzip
Finch Fish
Size (Gbp) 1.09 0.94 1.95 0.73
NG50 (Mbp) 3.0 0.6 2.6 0.02
BUSCO (c) 93.9 82.1 94.2 40.6
BUSCO (d) 5.0 3.3 20.8 3.4
1.2% 1.6%
24. Assemble the genomes
De novo assembly of haplotype-resolved genomes with trio binning.
Koren, Rhie, et al. Nature Biotechnology (2018)
×
DamSire
F1 cross
Parental
k-mers
Sire haplotype
Dam haplotype
Sire assembly Dam assembly
Unassigned
28. • The human reference genome is incomplete
• 368 unresolved issues, 102 gaps
• Segmental duplications, satellites, rDNAs
• Centromeres, telomeres, heterochromatin
• These gaps contain important information
• Missing reference sequence leads to analysis artifacts
• Variation in these gaps is unexplored (e.g. rDNAs)
• We don’t know what we don’t know…
We need to finish the genome
29. Our target: CHM13hTERT
Cell line from Urvashi Surti, Pitt; SKY karyotype from Jennifer Gerton and Tamara Potapova, Stowers
N=46; XX
30. • Repeats are long, reads are short
• “If the overlap is of sufficient length to distinguish
it from being a repeat in the sequence the two
sequences must be contiguous.”
— Rodger Staden, 1979
What’s the problem?
31. • How long are the repeats?
• 7 kbp LINEs
• 1 Mbp+ rDNA arrays
• 1 Mbp+ centromere arrays
• 10 Mbp+ heterochromatin blocks
• Coverage and accuracy matter too
• 1,000X of 100 bp reads at 100% accuracy? NO
• 10X of 10,000,000 bp reads at 100% accuracy, YES
• 100X of 100,000 bp reads at 90% accuracy, MAYBE?
How long do reads need to be, for human?
>50% of the genome
32. • Length at the expense of throughput
• Read lengths >1 Mbp possible
Ultra-long nanopore sequencing
Nanopore sequencing and assembly of a human genome with ultra-long reads.
Jain et al. Nature Biotechnology (2018)
33. • Prediction: 30x raw UL coverage == GRCh38
How much do we need?
Nanopore sequencing and assembly of a human genome with ultra-long reads.
Jain et al. Nature Biotechnology (2018)
34. • 30x Nanopore ultra-long
• Contig building
• 60x PacBio
• Polishing
• 50x 10x Genomics
• Polishing
• BioNano
• Structural validation
We need long reads. Lots of long reads
35. • Nanopore UL read length distribution is long tailed
It pays to go deep
repeat
36. • From May 1 – October 29, 2018
• 62 MinION/GridION flow cells
• 8.9M reads, 98 Gb, 1.6 Gb / cell
• N50 read length 76 kb
• 44 Gb in reads >100 kb
• Max read length 1.03 Mb
• Assembled with Canu
CHM13 sequencing
Now upwards of 90+ flow cells and counting…
45. • Mappers prefer the “best” alignment
• Consensus can be of variable quality (patches)
• Best mapping not always the correct mapping
• Marker-based anchoring
• Increase number of secondary alignments returned
• Redefine mapping quality to measure single-copy k-
mer agreement between read and assembly
Unique k-mer mapping
Before:
After:
49. • Almost!
• Have proven it’s possible for the X chromosome
• T2T assembly of all chrs within the next 2 years
• Challenges
• REPEATS, REPEATS, REPEATS
• Heterozygosity: diploids, polyploids, metagenomes
• Nanopore-only consensus quality
• Targeted long-read sequencing
Are we there yet?
50. • github.com/nanopore-wgs-consortium/chm13
• Draft whole-genome assemblies
• Nanopore ultra-long reads
• 10x Genomics reads
• BioNano DLS (WashU)
• PacBio (SRA)
• Coming soon:
• Arima Genomics Hi-C
• PacBio CCS
• Strand-Seq
All CHM13 data is openly released
51. NHGRI
• Sergey Koren
• Arang Rhie
• Jim Mullikin
• Alice Young
• Shelise Brooks
• Valerie Maduro
• Gerard Bouffard
• Sofia Barreira
• Andy Baxevanis
• Nancy Hansen
• Karen Miga, UCSC
• Jennifer Gerton, Stowers
• Tamara Potapova, Stowers
• Beth Sullivan, Duke
• Tina Graves Lindsay, WashU
• Ira Hall, WashU
• Valerie Schneider, NCBI
• Kerstin Howe, Sanger
• Jo Wood, Sanger
• Matt Loose, Nottingham
• Nick Loman, Birmingham
• Urvashi Surti, Pitt (ret.)
Acknowledgements
Evan Eichler, Mitchel Vollger, Glennis Logsdon, David Porubsky, Melanie Sorensen
52. It’s time to finish the human genome
Google “t2t consortium” – I’ll be hiring in the fall
The Telomere-to-Telomere (T2T) consortium is an
open, community-based effort to generate the
first complete assembly of a human genome.