Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Ashg2015 grc-pruitt
1. RefSeq curation and annotation of the
reference human genome GRCh38
Kim D. Pruitt
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
www.ncbi.nlm.nih.gov/refseq/
2. RefSeq Background
• RefSeq provides -
• Human genome annotation
• Known transcripts & proteins (manually curated)
• Model transcripts & proteins (annotation pipeline)
• Collaborations -
• Genome Reference Consortium (GRC)
• HUGO Gene Nomenclature Committee (HGNC)
• Consensus CDS (CCDS) Collaboration (HAVANA curators)
• RefSeqGene/Locus Reference Genomic (LRG)/LSDB
RefSeq: www.ncbi.nlm.nih.gov/refseq/ Gene: www.ncbi.nlm.nih.gov/gene/
An NCBI project to provide reference sequence
standards that incorporate current knowledge.
Archaea – Bacteria – Eukaryotes - Virus
3. Curation support of genic regions of
the reference human assembly
• RefSeqGene and LRG collaboration
• Genomic and cDNA standards for clinical reporting
• Report potential issues to the GRC
• Consensus CDS collaboration
• Stabilized human CDS annotation
• Report potential issues to the GRC
• RefSeq
• Curation of genes, transcript & protein records
• Report potential issues to the GRC
• Review GRC patch updates for gene annotation impact
5. Transition from GRCh37 to GRCh38
• Identify gene/sequence differences vs. GRCh38
• Automatic update at synonymous mismatches
• Curation review of remainder
• >5,100 Known RefSeq transcripts updated since October 2013
• 47,031 Known RefSeqs identical to genome
• 2,916 intentionally retain a mismatch or indel
• ~600 pending
• ~132 genes merged
0 200 400 600 800 1000 1200
2013 Q1
2013 Q3
2014 Q1
2014 Q3
2015 Q1
2015 Q3
Number of updates
* GRCh38 12/24/2013
*
6. Updating RefSeq to match GRCh38
• Post GRCh38 review:
• NM_173477 updated to match genome (NM_173477.4)
• Model RefSeq XM_005257026.1 promoted to Known RefSeq
GRCh38
GRCh37
alignment
alignment
8. RefSeq curation & genome
maintenance
• POLR2A (GeneID:5430) NM_000937.4 has a 2 nt deletion
vs. GRCh38
• This maintains the correct reading frame
GRCh38
alignment
9. RefSeq curation & genome
maintenance
• RefSeq reported this sequence issue to the GRC
10. GRCh38 ALT LOCI and PATCHES
Pre-Patch & ALT review
Polymorphic pseudogenes
Haplotype & CNV variation
ALT-specific RefSeq records
Curator-stored placement data
Evidence-based genome
annotation pipeline
Manual Curation
Assembly-ALT alignments
Alignment quality reports
Subsequent genome
annotation build corrects
the annotation
Interim alignment updates
11. Polymorphic pseudogenes
• RefSeq provides different transcripts to represent the protein-
coding gene versus the pseudogene
• Curators store assembly placement information (chromosome
versus ALT) in a local database
• This is used by annotation pipeline to ensure correct annotation
Assembly Unit GSTT1 GSTT2 GSTT2B GSTTP1 GSTTP2
GRCh38 chr22 null pseudo coding pseudo null
ALT_REF_LOCI_1 coding coding coding pseudo pseudo
An example – GSTT cluster on chromosome 22:
12. GSTT* variation, chromosome 22
• Copy number variation of glutathione-S-transferase theta genes
is associated with digestive track cancers and more
• Accurate gene annotation is important to downstream users
GRCh38 chr22
GRCh38 ALT
pseudogene
chr22 = null allelecoding allele
ulcerative colitis - laryngeal cancer - esophageal cancer - colorectal cancer