AGBT2017 Reference Workshop: Lindsay

Creating Reference-Grade
Human Genome Assemblies
Tina Graves Lindsay
Reference Genome Workshop at AGBT
Feb 13, 2017

The Human Reference is a Work in Progress!
• The current reference – GRCh38 - is not optimal for some
regions of the genome and/or some individuals/ancestries.
• GRCh38 is comprised of DNA from several individual humans.
• Allelic diversity and structural variation present major
challenges when assembling a representative diploid genome.
• New technologies, methods, and resources since 2003 have
allowed for substantial improvements in the reference genome.
• Additional high-quality reference sequences are needed to
represent the full range of genetic diversity in humans

AC074378.4
AC079749.5
AC134921.2
AC147055.2
AC140484.1
AC019173.4
AC093720.2
AC021146.7
NCBI36NC_000004.10 (chr4) Tiling Path
Xue Y et al, 2008
TMPRSS11E TMPRSS11E2
GRCh37NC_000004.11 (chr4) Tiling Path
AC074378.4
AC079749.5
AC134921.1
AC147055.2
AC093720.2
AC021146.7
TMPRSS11E
GRCh37: NT_167250.1 (UGT2B17 alternate locus)
AC074378.4
AC140484.1
AC019173.4
AC226496.2
AC021146.7
TMPRSS11E2
UGT2B17 – Conflicting Alleles
G
A
P

Definitions of Genome Level
• Platinum Genome
• Haploid genome source
• Contiguous, haplotype-resolved representation of entire genome
• BAC library available
• Gold Genome
• Diploid genome source
• Part of a trio
• Parents will be sequenced to help haplotype resolve some
regions
• BAC libraries available
• Targeted regions sequenced using these BAC libraries
• Will contain some haplotype resolved regions

CHM1: A Key Resource for Improving the Reference
• CHM1 cell line established from a haploid hydatidiform
mole (complete, paternal; 46XX) (U.Surti)
• CHORI-17 BAC library (P. deJong)
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1,560 fpc contigs)
• CHORI-17 BACs
• >750 have been sequenced
• 664 of them in Genbank as phase 3 sequence
• CHM1 WGS assembly
• Initial assembly produced from >100X coverage of Illumina data
• Initial PacBio assembly produced using ~54X of P5/C3 PacBio data
• Latest PacBio assembly produced using ~60X of P6/C4 PacBio data

Assembly Assessment Methods
• Assemblies run through NCBI QA pipeline
• Assessed for contiguity, annotation, and concordance with the
finished BACs
• Assembly Assembly alignments can be generated between each PB
assembly and GRCh38
• BioNano Genome Map
• SV calls generated from comparing the BioNano data to each of the
assemblies
• Hybrid scaffolding conflicts will also point out potential assembly
errors
• Alignment of the Illumina reads back to the each of the
assemblies
• Heterozygous calls are likely indicative of a collapse in the
assembly (for the haploid genomes)

Hybrid Scaffolds – PacBio and BioNano
Seq
Assem
Seq
Assem
Seq
Assem
BN
Hybrid
BN
Hybrid
BN
Hybrid
# of
Contigs
Contig
N50 (Mb)
Total
Size
(Gb)
# of
Scaffolds
Scaff N50
(Mb)
Total Size
(Gb)
CHM1 (P6)
GCA_001297185
MGI CHM1 map
(Jason’s version)
3641 26.9 2.99 161 47.6 2.84
CHM1 (P6)
GCA_001307025
MGI CHM1 Map
(Adam’s version)
4850 20.6 2.94 221 40.04 2.82

Hybrid Scaffold
Hybrid Scaffold
PacBio Contigs
BioNano Contigs

1q21 Region – GRCh38 vs GCA_001297185
1 Megabase
GRCh38
GCA_001297185
Seg Dup Track

1q21 Region - GRCh38 vs GCA_001297185
GRCh38
GCA_001297185
Seg Dup Track
99.9+% identity
99.1% identity
1 Megabase

CHM1 – Next Steps
• Currently running Pilon on GCA_001297185, for improved
base pair accuracy
• Based on alignment of BioNano data as well as
comparisons to GRCh38, we will make additional breaks
where needed
• Incorporate all finished BACs
• Final alignment to GRCh38 in order to produce
chromosome AGPs and submit

Genome Status
Data
Source
Origin Level of
Coverage
Status
CHM1 NA Platinum Assembly Improvement
CHM13 NA Platinum In Assembly Queue
NA19240 Yoruban Gold Assembly Submission
HG00733 Puerto Rican Gold Assessing New Assembly
HG00514 Han Chinese Gold Assessing New Assembly**
NA12878 European Gold Assessing New Assembly
HG01352 Columbian Gold Assessing New Assembly
HG02818 Gambian Gold Assembly Underway
HG02059 Kinh-Vietnamese Gold In Assembly Queue
NA19434 Luhya Gold In Assembly Queue
HG04217 Telugu Gold Data Production Underway
**100x coverage was generated for the Han Chinese sample

Genome Total Size
(older version
Falcon)
# Contigs
(older version
Falcon)
Contig N50
(older version
Falcon)
Contig N50
(newer version
Falcon)
NA19240 2.75 Gb 3569 6.0 Mb 26.4 Gb
HG00733 2.84 Gb 3715 7.6 Mb 22-23 Mb
NA12878 2.80 Gb 4412 4.49 Mb 14-15 Mb
HG01352 2.85 Gb 4080 8.22 Mb 20-24 Mb
HG00514 2.85 Gb 2808 10.0 Mb 22-24 Mb
HG02818 2.82 Gb 3300 7.24 Mb Assembly
underway
Assembly Stats

First Gold Genome - NA19240
• NA19240 – Yoruban sample
• Generated >70X raw P6/C4 RSII PacBio data
Initial Assembly
Stats
Latest Assembly Stats
# Seq Contigs 3569 2889
Max Contig Length 20,393,869 bp 75,769,079 bp
Total Assembly
Size
2,745,634,789 bp 2,874,720,146 bp
N50 6,003,115 bp 26,385,265 bp
N90 848,151 bp 2,559,914 bp
N95 345,457 bp 710,070 bp

Assembly QC and Submission Steps
Multiple Falcon
Assemblies
Using stats and
alignment to
Bionano, pick the
best assembly
Quiver and Pilon
on best assembly
Use Bionano to
identify mis-
assemblies and
scaffold assembly
Submit scaffold-
level AGPs to
Genbank
Run through NCBI
assembly QA
pipeline
Evaluate and
curate output of
QA pipeline
Generate final
chromosome level
AGPs and Submit
Annotation of
chromosome level
assembly

Hybrid Stats
Seq Assem Seq Assem Seq Assem BN Hybrid BN Hybrid BN Hybrid
# of
Contigs
Contig N50
(Mb)
Total Size
(Gb)
# of
Scaffolds
Scaffold
N50 (Mb)
Total Size
(Gb)
NA19240 2889 26.3 2.87 218 39.9 2.82
NA12878 3551 15.1 2.86 270 28.7 2.83
HG00514 3190 24.2 2.88 208 37.0 2.83

NA19240 Assembly Assessment
Initial Calls Breaks made
Conflicts 51 35
Translocation SV 321 16
Complex 123 9
Nucmer
Alignments
9
69 Total
breaks made
Contig # Contig N50 Total Assembly
Size
Before Breaks 2889 26.4 Mb 2.87 Gb
After Breaks 2951 25.7 Mb 2.87 Gb

Chimeric PacBio Contig
GRCh38 – Chr 1
GRCh38 – Chr 4
NA19240 Contig
NA19240 Contig
Segmental Duplications
Segmental Duplications

NA19240 Bionano Map Compared to GRCh38
SV Type Number of Calls
Insertion 1795
Deletion 756
End 71
Inversions 8
Complex 62
Translocations 6

NA19240 Inversion Compared to GRCh38
GRCh38
NA19240 Bionano Contigs

NA19240 MHC Region
GRCh38
Bionano Contigs

NA19240 MHC Region
NA19240
Reference
Alts
~65 kb insertion

Finished BACs Resolve This Region
GRCh38
PB Assembly
BAC Alignments
Seg Dup

Spanning Reference Gaps
• HG00514 80X assembly
• Initial assessment had 75 potential gap spanning contigs
• Closer look only 32 are real gap spanning contigs, that span 40
total gaps

True Gap Spanner
GRCh38
HG00514
Contig

False Gap Spanner
False
Alignment
Seg Dup
True
Alignment
7kb
3 kb
10 kb

Short Term Future Plans
• Lots of assemblies to analyze!
• Generate the latest Falcon assemblies for all samples
• Improve those assemblies
• Identifying misassemblies
• Making the breaks where needed
• Scaffolding the assemblies
• Incorporating BACs as they are finished
• Create Chromosomal AGPs
• Submit to Genbank

Longer Term Future Work
• Better Utilization of the Reference
• Mapping Strategies
• Graph based alignments
• Other alt-aware read mapping strategies
• Alternative reference data display challenges – When and how to
present data
• Alt alleles?
• Full reference sequences
• Haplo-resolved (10X)?
• Wet Lab Improvements
• Haplo-resolved strategies (10X)
• Clone-based work replacements? - Hyb 10X or Pac Bio?
• New long read technologies
• PacBio Sequel
• Oxford Nanopore

Acknowledgements
The McDonnell Genome Institute at
Washington University in St. Louis
Susan Dutcher
Bob Fulton
Wes Warren
Karyn Meltz Steinberg
Derek Albracht
Milinn Kremitzki
Susan Rock
Chad Tomlinson
Patrick Minx
Chris Markovic
Eddie Belter
Lee Trani
Sara Kohlberg
University of Washington
Evan Eichler
NCBI
Valerie Schneider
University of Pittsburgh
School of Medicine
(CHM1 and CHM13 cell line)
Urvashi Surti
BioNano Genomics
Alex Hastie
Pacific Biosciences
Jason Chin
Nick Sisneros
UCSF
Pui-Yan Kwok
Yvonne Lai
Chin Lin
Catherine Chu
NHGRI
Adam Phillippy
Sergey Koren
10X Genomics
Deanna Church
Nationwide Children’s Hospital
Richard Wilson
Vince Magrini
Sean McGrath

AGBT2017 Reference Workshop: Lindsay

AGBT2017 Reference Workshop: Lindsay

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (18)

Ähnlich wie AGBT2017 Reference Workshop: Lindsay

Ähnlich wie AGBT2017 Reference Workshop: Lindsay (20)

Mehr von Genome Reference Consortium

Mehr von Genome Reference Consortium (12)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

AGBT2017 Reference Workshop: Lindsay