Ashg grc workshop2015_tg

ASHG - GRC Workshop
Tina Lindsay
ASHG Oct 6, 2015

The Human Reference is Not Complete
• Reference has been found to not be optimal in some
regions
• Structural variation makes it difficult to assemble a truly
representative genome when using a diploid sample
• Some regions were recalcitrant to closure with technology
and resources available at the time
• Additional sequences are needed to capture the full range
of diversity in humans

AC074378.4
AC079749.5
AC134921.2
AC147055.2
AC140484.1
AC019173.4
AC093720.2
AC021146.7
NCBI36NC_000004.10 (chr4) Tiling Path
Xue Y et al, 2008
TMPRSS11E TMPRSS11E2
GRCh37NC_000004.11 (chr4) Tiling Path
AC074378.4
AC079749.5
AC134921.1
AC147055.2
AC093720.2
AC021146.7
TMPRSS11E
GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC074378.4
AC140484.1
AC019173.4
AC226496.2
AC021146.7
TMPRSS11E2
UGT2B17 – Conflicting Alleles
G
A
P

Allelic Diversity vs. Segmental Duplication
A
A
C
T
C
G
C
C
Repeat Copies (noted by color difference)
Allelic
Copies
Diploid Genome
With a diploid genome, there is significant ambiguity sorting allelic copies from repeat copies
A C C C
Haploid Genome
Repeat Copies (ONLY but noted by color difference)
With a haploid genome, allelic differences are eliminated, and base differences are likely
indicative of repeat copies

Initial Use Of CHM1 Source
• CHORI-17 BAC Library
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)
• CHORI-17 BACs
• > 750 have been sequenced
• 664 of them in Genbank as phase 3

SRGAP2 Homology between genes
Shows nearly identical segments between SRGAP2A and SRGAP2 paralogs
Shows homology between SRGAP2B and SRGAP2C
Dennis, et.al. 2012
SRGAP2A
SRGAP2B
SRGAP2C

1q21
1q21 patch alignment to chromosome 1
1q32 1q21 1p21

Williams-Beuren Syndrome region
Slide courtesy of Megan Dennis

Current status of CHM1 resources
• CHORI-17 BAC Library (created from CHM1 cell line)
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)
• CHORI-17 BACs (>750 have been sequenced, with 664 of them in
Genbank as phase 3)
• Active cell line
• >100X coverage Illumina 100bp reads
• 300, 500bp, 3kb inserts
• Reference assisted assembly CHM1_1.1
• BioNano genome map
• >60X coverage of PacBio long read data (Both P5 and P6)
• Multiple whole genome assemblies

PacBio CHM1 Assembly Spans GRCh38 Gaps
GRCh38
PacBio CHM1

PacBio CHM1 Assembly Shows Data Not in GRCh38
GRCh38
PacBio CHM1
Second Pass Alignment

Some of the Targeted Regions
CFHR1
SRGAP2/FAM72
BOLA2/CORO1A/SLX1
ARHGAP11
CHRNA7
GTF2IRD2/GTF2I/NCF1
FRMPD2/PTPN20
GPRIN2/PPYR1
DUSP22
HYDIN
IgH
IgK
IgL
TCRA/B
NBPF
DEFB
MUC5a/b/c
LILR
CCL
FCGR1/HIST2H2B
NOTCH2

Genomes Planned
Data Source Origin of Sample Coverage Level Status
CHM1 NA Platinum Assembly QC
CHM13 NA Platinum Assembly QC
NA19240 Yoruban Gold In Assembly
HG00733 Puerto Rican Gold Data Generation
NA12878 European Gold Not Started
HG00514 Han Chinese Gold Not Started
NA19434 Luhya Gold Not Started

CHM13 – 2nd Platinum Genome
• CHM13 – another hydatidiform mole sample
• PacBio data generated
• 60X data was generated using P5 and P6 Chemistry
• Avg read length ~11kb, longer than original CHM1 data
• Assembly Contig N50 ~13Mb
• Illumina coverage has been generated to use for assembly QC, SV
detection, and consensus base error correction
• Plan to use BACs to improve the assembly where needed
• Alignment of Assembly to BioNano Genome map
• Currently ~91% of CHM13 assembly aligns to BioNano map
contigs

CHM13 Mini-Assemblethon
Falcon MHAP
Default 5%
Error
MHAP
Conservative
2.5%
MHAP
Sensitive 5%
MHAP
Sensitive
2.5%
# of
Contigs
2873 15,538 10,430 11,138 13,500
Max
Contig
Length
63,148,543 81,522,549 34,039,925 80,601,297 58,311,553
Contig
N50
12,981,785 13,331,528 5,550,336 19,357,701 11,964,038
Total
Assembly
Size
2,851,367,788 3,061,261,250 2,996,416,935 3,028,933,694 3,086,573,229

Assembly Assessment Methods
• Assemblies will run through NCBI QA pipeline
• Assessed for contiguity, annotation, and concordance with the
finished BAC paths
• Assembly Assembly alignments will be generated between each PB
assembly as well as GRCh38
• BioNano Genome Map
• SV calls generated from comparing the BioNano data to each of the
assemblies
• Generating hybrid scaffolds using BioNano data and assembly data
• Alignment of the Illumina reads back to the each of the
assemblies
• Heterozygous calls are likely indicative of a collapse in the
assembly (for the single haplotype genomes)

BioNano SV Calls Identified a Assembly Problems
Collapse
Expansion
inAssembly
Gap in SequenceAssembly
BioNano Map

CHM13 Hybrid Scaffolds
BioNano Map PacBio Assmbly Hybrid Scaffold
# of Contigs 3593 1590 * 254
Min Contig Length 0.08 Mb 0 0.27 Mb
Median Contig Length 0.61 Mb 0.06 Mb 4.35 Mb
Mean Contig Length 0.78 Mb 1.78 Mb 9.68 Mb
Contig N50 1.02 Mb 13.46 Mb 20.79 Mb
Max Contig Length 5.27 Mb 63.15 Mb 82.83 Mb
Total Contig Length 2812 Mb 2824 Mb 2457.75 Mb
*Number of contigs used in hyrbid scaffolding
57 PacBio contigs and 67 BN contigs were identified as conflicts during this process

CHM13 Hybrid Scaffold
Hybrid Scaffold
PacBio Contigs
BioNano Contigs

NA19240 Initial Assembly Stats
Initial Assembly Stats
# Seq Contigs 3569
Max Contig Length 20,393,869bp
Total Assembly Size 2,745,634,789 bp
N50 6,003,115 bp
N90 848,151 bp
N95 345,457 bp

Future Directions
• Identification of best assembly for on CHM1 and CHM13
• Integration of targeted BACs into the whole genome assembly
• Improvement of the assemblies through scaffolding and making
breaks in the assemblies where needed
• Continue to add diversity to the reference by sequencing
new samples that provide additional diversity to GRCh38
• Additional collaborations with the community to develop
tools to more fully utilize the full reference assembly
(alternate haplotypes)

Acknowledgements
The Genome Institute at Washington
University in St. Louis
Rick Wilson
Bob Fulton
Wes Warren
Karyn Meltz Steinberg
Vince Magrini
Derek Albracht
Milinn Kremitzki
Susan Rock
Debbie Scheer
Aye Wollam
The Finishing and Bioinformatics Teams
at The Genome Institute
University of Washington
Evan Eichler
Megan Dennis
Xander Nuttler
NCBI
Richa Argwala
Valerie Schneider
University of Pittsburgh
School of Medicine (CHM1 cell line)
Urvashi Surti
Personalis
Deanna Church
BioNano Genomics
Pacific Biosciences
UCSF
Pui-Yan Kwok
Yvonne Lai
Chin Lin
Catherine ChuCHORI
Pieter de Jong

Ashg grc workshop2015_tg

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Ashg grc workshop2015_tg

Ähnlich wie Ashg grc workshop2015_tg (20)

Mehr von Genome Reference Consortium

Mehr von Genome Reference Consortium (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ashg grc workshop2015_tg