The Matched Annotation from NCBI and EMBL-EBI (MANE) Project

A standardized “default” transcript set
The Matched Annotation from NCBI and EMBL-EBI
(MANE) Project
Joannella Morales, PhD
European Bioinformatics Institute (EMBL-EBI)
jmorales@ebi.ac.uk
ASHG 2019

Rationale
• Accurate identification and description of the genes in the human genome is
foundational for biology
• The availability of high-quality reference materials is essential for clinical genomics
• Comprehensive transcript annotation is central to this endeavor

Sources of transcript annotation:
RefSeq and Ensembl/GENCODE
NCBI’s RefSeq:
• NM_xxxxxx: manually annotated; XM_xxxxxx: automatically produced
• May not match the primary reference genome:
• represent a prevalent, 'standard' allele but not always reference
• Clinical annotation predominantly done using RefSeq transcripts
EBI’s Ensembl/GENCODE:
• ENSTxxxxxx: More manually-reviewed transcripts
• Must match primary reference genome
• On average more Ensembl transcripts per gene compared to RefSeqs
• Reference set for gnomAD/ ExAC, GTEx, Decipher, 100,000 Genomes Project, ICGC etc.

Rationale
• Comprehensive annotation is good BUT…
• This can cause some challenges in the clinical context
• There are numerous alternatively spliced transcripts for a given gene
• Transcripts get updated over time – version changes, hard to track a variant over time
• There is no standard
• Variant reporting can be done on any transcript
• Commonly used tools (gnomAD, HGMD, Decipher etc.) often have different “canonical” transcripts
• Which one(s) should be used?
• Often the longest transcript at the locus (or the first one described) is used
• Even though this one may not be relevant (e.g. minor or not expressed in tissue of interest)

Solution: Define a joint ‘representative’ transcript set
• Standardize transcript set across genomics browsers
• VEP, gnomAD, HGMD, COSMIC, UniProt, others all have their own “canonicals”
• Identify a transcript that captures the most information about each protein-coding
gene
• Standardize clinical reporting
• Useful as starting point for comparative/evolutionary genomics
• All transcripts should always be considered for clinical interpretation
• We are NOT saying that biology can be simplified to a single transcript at
each genomic locus

What is MANE?
(Matched Annotation from the NCBI and EMBL-EBI)
• A transcript set with the following attributes:
• Must match GRCh38 sequence
• 100% identical between the RefSeq and corresponding Ensembl transcript
• 5’UTR, CDS, and 3’UTR
• Transcripts should be:
• Well-supported, expressed, conserved
• Representative of biology at each locus
• Phase 1 - MANE Select – One transcript for each protein-coding locus; to be used as “default”
across genomics resources
• Phase 2 - MANE Plus – Additional well-supported transcripts of particular interest
• For example, for clinical reporting

• Automated with a layer of manual review
• Built independent pipelines to select a transcript from each set
MANE Select Methodology
• RefSeq Select Pipeline
• Expression
• Conservation
• Representation in UniProt and Ensembl
• Length
• Prior manual curation (LRG)
• Ensembl Select Pipeline
• Length
• Expression
• Conservation
• Representation in UniProt and RefSeq
• Coverage of pathogenic variants

Review UTRs
5’ 3’
Identical splicing, CDS, UTRs
5’ 3’
MANE Select
MANE Select Methodology
5’ 3’
RefSeq
5’ 3’
Ensembl/GENCODE
Step 1
Select
Step 2
Review
Step 3
Match

Initial pipeline comparison and bins
Bin1: Identical
Bin 2: Same CDS, but
different UTR length or
splicing pattern
Bin 3: Different CDS, with
or without different UTR
length or splicing pattern
or Majority of cases
Complex loci
Annotation differences

Reducing Bin 2
Bin 2 = Both pipelines pick same CDS. Chosen ENST and NM only differ in UTR
length and/or UTR splicing pattern
• Defined rules to jointly define extent of 5’ and 3’ UTRs
• “Longest strong”
• Trimmed/Extended ends in an automated manner

Selecting UTRs, 5’ end:
CAGE = Cap Analysis of Gene Expression, developed by RIKEN
This is a way of getting the full 5’ end of messenger RNA.The output of CAGE is tags, and these give a
quantification of the RNA abundance.
Longest StrongestLongest
strong
Ensembl/
GENCODE
RefSeq
RNAseq
CAGE counts
Ensembl Genome Browser
KNG1

Ensembl
RefSeq
RNAseq
PolyA counts
Longest
Longest
Strong
REM2
NCBI’s Genome DataViewer
PolyA seq:This is data from the 3’ end. It is the sequence from the polyadenlyated
region of mRNA, defining the end of a transcript.
Selecting UTRs, 3’ end:
INSDC coverage

• Bin 3 = Pipelines picked different CDS
• Improved pipelines, based on review of genes in bin 3
• Manually curating genes unresolved after pipeline improvement (prioritizing clinical
genes)
• This is the hardest bin!
• In some cases, there is no right answer. Either one could be selected. This is
biology!
• In other cases, the corresponding transcript in the other set does not exist, thus
requiring a full annotation update. Very time consuming!
Reducing Bin 3

MANE Select Progress Update
• In April, we released:
• MANE Select v0.5 on all browsers, with coverage of 54% across the genome
• In September, we released:
• MANE Select v0.6 on all browsers, increasing MANE Select coverage to 67% across the genome
• Identified additional 4% to increase coverage to 71% of across the genome
• We are aiming to increase coverage to 75 – 80% by the end of the year
• Our ultimate goal is to achieve genome-wide coverage by 2020

Accessing MANE: NCBI’s FTP
ftp://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/

Limitations
• MANE Select does not capture biological complexity (requires a single choice)
• Transcripts excluded may score approximately equal to MANE Select on any or all
supporting attributes
• Tissue-specificity vs general pattern of expression
• Most highly supported transcript might exclude important tissue specific or clinically
relevant isoforms
• Gaps in data
• Insufficient information to determine transcriptional specificity
• Transcript level quantification still difficult

Summary
• NCBI and EMBL-EBI are working together to review annotation and produce a matched set
of “high-value” transcripts
• These transcripts will match GRCh38 and will represent 100% identity between a RefSeq
and its corresponding Ensembl transcript
• We will define one “default” transcript per locus (MANE Select)
• We aim to have widespread adoption of MANE Select as default across genomics resources
• We will define additional well-supported transcript (MANE Plus)
• We expect all transcripts required for clinical reporting to be in Select and Plus
• Feedback welcome - MANE-help@ebi.ac.uk

Fiona Cunningham, Variation Annotation Team Lead
Adam Frankish, Manual Genome Annotation Coordinator
This research was supported by the Intramural Research
Program of the NIH, National Library of Medicine.
RefSeq Curators
Shashi Pujar
Eric Cox
Catherine Farrell
TamaraGoldfarb
John Jackson
Vinita Joardar
Kelly McGarvey
Michael Murphy
Nuala O’Leary
Bhanu Rajput
Sanjida Rangwala
Lillian Riddick
DavidWebb
Terence Murphy, RefSeq Team Lead
RefSeq Developers
AlexAstashyn
Olga Ermolaeva
Vamsi Kodali
CraigWallin
Acknowledgments
MANE-help@ebi.ac.uk
Matt Hardy
Mike Kay
Aoife McMahon
Marie-MartheSuner
GlenThreadgold
MANE-help@ncbi.nlm.nih.gov
Ensembl/LRG curators
Jane Loveland
Joannella Morales
Ruth Bennett
Andrew Berry
Claire Davidson
Laurent Gil
Jose Manuel Gonzalez

The Matched Annotation from NCBI and EMBL-EBI (MANE) Project

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie The Matched Annotation from NCBI and EMBL-EBI (MANE) Project

Ähnlich wie The Matched Annotation from NCBI and EMBL-EBI (MANE) Project (20)

Mehr von Genome Reference Consortium

Mehr von Genome Reference Consortium (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The Matched Annotation from NCBI and EMBL-EBI (MANE) Project

Hinweis der Redaktion