This document discusses the reference genome assembly and how it is changing. It provides an overview of why the reference assembly matters, how the assembly is constructed and updated, and tools for finding assembly and variation data. Key points include: the assembly is a model that may have gaps; the human reference assembly has been updated several times; alternate loci are used to represent structural variants and haplotypes; and ongoing work involves adding novel sequence and fixing rare incorrect bases or assembly problems.
3. Variation ResourcesTeam at NCBI
Ming Ward
Lon Phan
Brad Holmes
Anna Glodek
Michael Kholodov
Rama Maiti
Juliana Sampson
David Shao
Eugene Shekhtman
Qiang Wang
Hua Zhang
Donna Maglott
Melissa Landrum
Jennifer Lee
George Riley
Ray Tully
Craig Wallin
Shanmuga Chitipiralla
Douglas Hoffman
Wonhee Jang
Ken Katz
Michael Ovetsky
Ricardo Villamarin
Tim Hefferon
John Lopez
John Garner
Chao Chen
4. Learning Objectives
Why the reference assembly matters for your analysis
How the reference assembly is changing
Tools and Resources to find data
24. Build sequence contigs based on contigs
defined in TPF (Tiling Path File).
Check for orientation consistencies
Select switch points
Instantiate sequence for further analysis
Switch point
Consensus sequence
27. NCBI35 (hg17) Tiling Path
GRCh37 (hg19) Tiling Path
Gap Inserted
Moved approximately 2 Mb
distal on chr15
NC_0000015.8 (chr15)
NC_0000015.9 (chr15)
Removed from assembly
Added to assembly
HG-24
28. Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
41. GenBank RefSeqvs
Submitter Owned RefSeq Owned
Redundancy Non-Redundant
Updated rarely Curated
INSDC Not INSDC
BRCA1
83 genomic records
31 mRNA records
27 protein records
3 genomic records
5 mRNA records
1 RNA record
5 protein records
49. Hydin: chr16 (16q22.2)
Hydin2: chr1 (1q21.1)
Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38
Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
(Paralogous)
(Allelic)
Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
Doggett et al., 2006
65. dbSNP Build 138 based on annotation run 104
Model based paralogous sequence differences, NCBI annotation run #
Paralogous/pseudo gene alignments, NCBI annotation run #
Single Unique Nucleotide (SUN) map, Sudmant 2010
ClinVar Long Variations
GRC Curation Issues
ClinVar Short Variations
Hinweis der Redaktion
Signpost for biological knowledge: ideogram + list of tracks.
To address assembly issues the GRC to centralize the production of the reference assembly. This gives the community a single point of contact for reporting problems and finding information about the assembly. Additionally, we serve as an aggregator of information- as individual labs find or fix problems, we can integrate this information into the reference assembly so everyone can have access to this data.
Insert dot matrix alignment- pull from assembly-assembly alignments
Alignments refer to pairs of sequence. Once you know how a pair of sequences go together, you can look at stringing the pairs along into a contig. The contig is essentially the consensus sequence that is produced from the components.To create a contig, we use the steps shown on this slide.What are switch points? As you create the consensus sequence of the contig, the switch points tell you where to stop using the sequence from one component and begin using the sequence from the next.
If you are not using the entire assembly in your efforts, you may be missing genes in your exome capture reagents.
Show alignment of a feature from first slide to show how far down the chromosome it has moved…
Keeping track of people is way easier than keeping track of assemblies.
RefSeqGene/LRG screen shot: stable coordinate system for gene level reporting. Gene centric genomic sequences.
Distribution of RefSeqGenes on GRCh37
Remap
Look up how much novel sequence addedAcross all patches: 35 Mb of sequence added
For the intermediate build GRCh37B, we are updating a subset of the high-confidence bases, about 1000, as our proof-of-principle. This panel shows reads from NA12878 aligned to chr. 19 that identify a base with MAF=0 in the LIN37 locus. This creates a non-consensus splice site.To create accessioned sequence for correcting the reference, we are using cortex_con (Iqbal and Caccamo) to generate mini-contigs (>= 50 bp) from collections of 1kG and RP11 WGS reads, the former selected from random 1kG populations.
There are several mechanisms we can use for capturing decoy.Much of the decoy represents centromeric repeat sequence. In collaboration with Karen Hayden in Jim Kent’s lab at UCSC, the GRC is planning to include modeled centromeric sequences in GRCh38.
Adding NOVEL sequence for GRCh38 doesn’t just mean adding sequence that is completely unrepresented in GRCh37. While many of the NOVEL patches, like the one on the previous slide, represent indels, adding novel sequence also means adding sequence variants for regions too complex to be represented by a single path.There is substantial variation at the LRC/KIR region on chr. 19. As shown on this slide, not only has the GRC replaced the GRCh37 path, which was derived from components from different clone libraries, with a single haplotype path from the CHM1 assembly, it also now has 8 different haplotypes represented as alternate loci. The addition of another 10+ haplotypes at this locus is also under consideration.
Update to GRCh37.p13The GRC has been releasing patches to the human assembly on a quarterly cycle, and we’re now at GRCh37.p12. There are two varieties of patches:FIX patches correct existing assembly problems: chromosome will update, patches integrated in GRCh38NOVEL patches add new sequence representations: will become alternate lociThis ideogram shows the current distribution of patches and alternate loci, and you can see that many regions have changed since GRCh37. Note that approximately 3% of the current public human assembly GRCh37 is associated with a region that is represented by a patch or alternate locus.