1. ASHG - GRC Workshop
Tina Lindsay
ASHG Oct 6, 2015
2. The Human Reference is Not Complete
• Reference has been found to not be optimal in some
regions
• Structural variation makes it difficult to assemble a truly
representative genome when using a diploid sample
• Some regions were recalcitrant to closure with technology
and resources available at the time
• Additional sequences are needed to capture the full range
of diversity in humans
4. Allelic Diversity vs. Segmental Duplication
A
A
C
T
C
G
C
C
Repeat Copies (noted by color difference)
Allelic
Copies
Diploid Genome
With a diploid genome, there is significant ambiguity sorting allelic copies from repeat copies
A C C C
Haploid Genome
Repeat Copies (ONLY but noted by color difference)
With a haploid genome, allelic differences are eliminated, and base differences are likely
indicative of repeat copies
5. Initial Use Of CHM1 Source
• CHORI-17 BAC Library
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)
• CHORI-17 BACs
• > 750 have been sequenced
• 664 of them in Genbank as phase 3
6. SRGAP2 Homology between genes
Shows nearly identical segments between SRGAP2A and SRGAP2 paralogs
Shows homology between SRGAP2B and SRGAP2C
Dennis, et.al. 2012
SRGAP2A
SRGAP2B
SRGAP2C
9. Current status of CHM1 resources
• CHORI-17 BAC Library (created from CHM1 cell line)
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)
• CHORI-17 BACs (>750 have been sequenced, with 664 of them in
Genbank as phase 3)
• Active cell line
• >100X coverage Illumina 100bp reads
• 300, 500bp, 3kb inserts
• Reference assisted assembly CHM1_1.1
• BioNano genome map
• >60X coverage of PacBio long read data (Both P5 and P6)
• Multiple whole genome assemblies
13. Some of the Targeted Regions
CFHR1
SRGAP2/FAM72
BOLA2/CORO1A/SLX1
ARHGAP11
CHRNA7
GTF2IRD2/GTF2I/NCF1
FRMPD2/PTPN20
GPRIN2/PPYR1
DUSP22
HYDIN
IgH
IgK
IgL
TCRA/B
NBPF
DEFB
MUC5a/b/c
LILR
CCL
FCGR1/HIST2H2B
NOTCH2
14. Genomes Planned
Data Source Origin of Sample Coverage Level Status
CHM1 NA Platinum Assembly QC
CHM13 NA Platinum Assembly QC
NA19240 Yoruban Gold In Assembly
HG00733 Puerto Rican Gold Data Generation
NA12878 European Gold Not Started
HG00514 Han Chinese Gold Not Started
NA19434 Luhya Gold Not Started
15. CHM13 – 2nd Platinum Genome
• CHM13 – another hydatidiform mole sample
• PacBio data generated
• 60X data was generated using P5 and P6 Chemistry
• Avg read length ~11kb, longer than original CHM1 data
• Assembly Contig N50 ~13Mb
• Illumina coverage has been generated to use for assembly QC, SV
detection, and consensus base error correction
• Plan to use BACs to improve the assembly where needed
• Alignment of Assembly to BioNano Genome map
• Currently ~91% of CHM13 assembly aligns to BioNano map
contigs
17. Assembly Assessment Methods
• Assemblies will run through NCBI QA pipeline
• Assessed for contiguity, annotation, and concordance with the
finished BAC paths
• Assembly Assembly alignments will be generated between each PB
assembly as well as GRCh38
• BioNano Genome Map
• SV calls generated from comparing the BioNano data to each of the
assemblies
• Generating hybrid scaffolds using BioNano data and assembly data
• Alignment of the Illumina reads back to the each of the
assemblies
• Heterozygous calls are likely indicative of a collapse in the
assembly (for the single haplotype genomes)
18. BioNano SV Calls Identified a Assembly Problems
Collapse
Expansion
inAssembly
Gap in SequenceAssembly
BioNano Map
19. CHM13 Hybrid Scaffolds
BioNano Map PacBio Assmbly Hybrid Scaffold
# of Contigs 3593 1590 * 254
Min Contig Length 0.08 Mb 0 0.27 Mb
Median Contig Length 0.61 Mb 0.06 Mb 4.35 Mb
Mean Contig Length 0.78 Mb 1.78 Mb 9.68 Mb
Contig N50 1.02 Mb 13.46 Mb 20.79 Mb
Max Contig Length 5.27 Mb 63.15 Mb 82.83 Mb
Total Contig Length 2812 Mb 2824 Mb 2457.75 Mb
*Number of contigs used in hyrbid scaffolding
57 PacBio contigs and 67 BN contigs were identified as conflicts during this process
21. NA19240 Initial Assembly Stats
Initial Assembly Stats
# Seq Contigs 3569
Max Contig Length 20,393,869bp
Total Assembly Size 2,745,634,789 bp
N50 6,003,115 bp
N90 848,151 bp
N95 345,457 bp
22. Future Directions
• Identification of best assembly for on CHM1 and CHM13
• Integration of targeted BACs into the whole genome assembly
• Improvement of the assemblies through scaffolding and making
breaks in the assemblies where needed
• Continue to add diversity to the reference by sequencing
new samples that provide additional diversity to GRCh38
• Additional collaborations with the community to develop
tools to more fully utilize the full reference assembly
(alternate haplotypes)
23. Acknowledgements
The Genome Institute at Washington
University in St. Louis
Rick Wilson
Bob Fulton
Wes Warren
Karyn Meltz Steinberg
Vince Magrini
Derek Albracht
Milinn Kremitzki
Susan Rock
Debbie Scheer
Aye Wollam
The Finishing and Bioinformatics Teams
at The Genome Institute
University of Washington
Evan Eichler
Megan Dennis
Xander Nuttler
NCBI
Richa Argwala
Valerie Schneider
University of Pittsburgh
School of Medicine (CHM1 cell line)
Urvashi Surti
Personalis
Deanna Church
BioNano Genomics
Pacific Biosciences
UCSF
Pui-Yan Kwok
Yvonne Lai
Chin Lin
Catherine ChuCHORI
Pieter de Jong