2. The Human Reference is a Work in Progress!
⢠The current reference â GRCh38 - is not optimal for some
regions of the genome and/or some individuals/ancestries.
⢠GRCh38 is comprised of DNA from several individual humans.
⢠Allelic diversity and structural variation present major
challenges when assembling a representative diploid genome.
⢠New technologies, methods, and resources since 2003 have
allowed for substantial improvements in the reference genome.
⢠Additional high-quality reference sequences are needed to
represent the full range of genetic diversity in humans
6. Definitions of Genome Level
⢠Platinum Genome
⢠Haploid genome source
⢠Contiguous, haplotype-resolved representation of entire genome
⢠BAC library available
⢠Gold Genome
⢠Diploid genome source
⢠Part of a trio
⢠Parents will be sequenced to help haplotype resolve some
regions
⢠BAC libraries available
⢠Targeted regions sequenced using these BAC libraries
⢠Will contain some haplotype resolved regions
7. CHM1: A Key Resource for Improving the Reference
⢠CHM1 cell line established from a haploid hydatidiform
mole (complete, paternal; 46XX) (U.Surti)
⢠CHORI-17 BAC library (P. deJong)
⢠CHORI-17 BAC end sequences (n=325,659)
⢠CHORI-17 multiple enzyme fingerprint map (1,560 fpc contigs)
⢠CHORI-17 BACs
⢠>750 have been sequenced
⢠664 of them in Genbank as phase 3 sequence
⢠CHM1 WGS assembly
⢠Initial assembly produced from >100X coverage of Illumina data
⢠Initial PacBio assembly produced using ~54X of P5/C3 PacBio data
⢠Latest PacBio assembly produced using ~60X of P6/C4 PacBio data
8. Assembly Assessment Methods
⢠Assemblies run through NCBI QA pipeline
⢠Assessed for contiguity, annotation, and concordance with the
finished BACs
⢠Assembly Assembly alignments can be generated between each PB
assembly and GRCh38
⢠BioNano Genome Map
⢠SV calls generated from comparing the BioNano data to each of the
assemblies
⢠Hybrid scaffolding conflicts will also point out potential assembly
errors
⢠Alignment of the Illumina reads back to the each of the
assemblies
⢠Heterozygous calls are likely indicative of a collapse in the
assembly (for the haploid genomes)
11. 1q21 Region â GRCh38 vs GCA_001297185
1 Megabase
GRCh38
GCA_001297185
Seg Dup Track
12. 1q21 Region - GRCh38 vs GCA_001297185
GRCh38
GCA_001297185
Seg Dup Track
99.9+% identity
99.1% identity
1 Megabase
13. CHM1 â Next Steps
⢠Currently running Pilon on GCA_001297185, for improved
base pair accuracy
⢠Based on alignment of BioNano data as well as
comparisons to GRCh38, we will make additional breaks
where needed
⢠Incorporate all finished BACs
⢠Final alignment to GRCh38 in order to produce
chromosome AGPs and submit
15. Genome Status
Data
Source
Origin Level of
Coverage
Status
CHM1 NA Platinum Assembly Improvement
CHM13 NA Platinum In Assembly Queue
NA19240 Yoruban Gold Assembly Submission
HG00733 Puerto Rican Gold Assessing New Assembly
HG00514 Han Chinese Gold Assessing New Assembly**
NA12878 European Gold Assessing New Assembly
HG01352 Columbian Gold Assessing New Assembly
HG02818 Gambian Gold Assembly Underway
HG02059 Kinh-Vietnamese Gold In Assembly Queue
NA19434 Luhya Gold In Assembly Queue
HG04217 Telugu Gold Data Production Underway
**100x coverage was generated for the Han Chinese sample
17. First Gold Genome - NA19240
⢠NA19240 â Yoruban sample
⢠Generated >70X raw P6/C4 RSII PacBio data
Initial Assembly
Stats
Latest Assembly Stats
# Seq Contigs 3569 2889
Max Contig Length 20,393,869 bp 75,769,079 bp
Total Assembly
Size
2,745,634,789 bp 2,874,720,146 bp
N50 6,003,115 bp 26,385,265 bp
N90 848,151 bp 2,559,914 bp
N95 345,457 bp 710,070 bp
18. Assembly QC and Submission Steps
Multiple Falcon
Assemblies
Using stats and
alignment to
Bionano, pick the
best assembly
Quiver and Pilon
on best assembly
Use Bionano to
identify mis-
assemblies and
scaffold assembly
Submit scaffold-
level AGPs to
Genbank
Run through NCBI
assembly QA
pipeline
Evaluate and
curate output of
QA pipeline
Generate final
chromosome level
AGPs and Submit
Annotation of
chromosome level
assembly
28. Spanning Reference Gaps
⢠HG00514 80X assembly
⢠Initial assessment had 75 potential gap spanning contigs
⢠Closer look only 32 are real gap spanning contigs, that span 40
total gaps
31. Short Term Future Plans
⢠Lots of assemblies to analyze!
⢠Generate the latest Falcon assemblies for all samples
⢠Improve those assemblies
⢠Identifying misassemblies
⢠Making the breaks where needed
⢠Scaffolding the assemblies
⢠Incorporating BACs as they are finished
⢠Create Chromosomal AGPs
⢠Submit to Genbank
32. Longer Term Future Work
⢠Better Utilization of the Reference
⢠Mapping Strategies
⢠Graph based alignments
⢠Other alt-aware read mapping strategies
⢠Alternative reference data display challenges â When and how to
present data
⢠Alt alleles?
⢠Full reference sequences
⢠Haplo-resolved (10X)?
⢠Wet Lab Improvements
⢠Haplo-resolved strategies (10X)
⢠Clone-based work replacements? - Hyb 10X or Pac Bio?
⢠New long read technologies
⢠PacBio Sequel
⢠Oxford Nanopore
33. Acknowledgements
The McDonnell Genome Institute at
Washington University in St. Louis
Susan Dutcher
Bob Fulton
Wes Warren
Karyn Meltz Steinberg
Derek Albracht
Milinn Kremitzki
Susan Rock
Chad Tomlinson
Patrick Minx
Chris Markovic
Eddie Belter
Lee Trani
Sara Kohlberg
University of Washington
Evan Eichler
NCBI
Valerie Schneider
University of Pittsburgh
School of Medicine
(CHM1 and CHM13 cell line)
Urvashi Surti
BioNano Genomics
Alex Hastie
Pacific Biosciences
Jason Chin
Nick Sisneros
UCSF
Pui-Yan Kwok
Yvonne Lai
Chin Lin
Catherine Chu
NHGRI
Adam Phillippy
Sergey Koren
10X Genomics
Deanna Church
Nationwide Childrenâs Hospital
Richard Wilson
Vince Magrini
Sean McGrath