Better vocabulary = better bioinformatics with standardized databases and ontologies

WHAT'S IN A NAME?
Better vocabulary = better bioinformatics???

From ﬂickr user giantginkgo
# Author: Keith Bradnam, Genome Center, UC Davis
# This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike
3.0 Unported License.

http://biomickwatson.wordpress.com

Most of the interesting 'stuff' that I discover about bioinformatics and genomics comes from
a) twitter, b) blogs, and c) papers (in that order). Mick Watson has fun and engaging blog
about bioinformatics and today he raised an important point: the lack of standardization in
scientiﬁc databases leads to frustration (and frustration leads to...suffering).

http://biomickwatson.wordpress.com

These are some terms that appear in the same database. You can code solutions for some of
this variation (e.g. British/American English differences or presence/absence of underscore vs
space character), but who wants to waste time doing that? Shouldn't these databases be
using controlled vocabularies?

This infamous paper from 2004 reveals how easy it is to introduce errors into biological
databases.

First highlighted column = actual gene name.
Second highlighted column = what Excel will automatically assume you mean.

RIKEN ID: 2310009E13

Happens for other identiﬁers as well. This RIKEN ID will change if it ever ends up in Excel...

RIKEN ID: 2.31E+13

...now it appears as a number in scientiﬁc notation.

The paper shows that these 'dates-as-gene-names' ended up propagating to other
databases.

I searched today for '2-Sep' at GenBank and this was the only hit. It's possible that this is an
intended gene-name variant, but Septin 2 is usually referred to as sep2/sept2/sep-2 etc. So
this is possibly another Excel-based error.

Sometimes people make assumptions that gene names are unique to a speciﬁc function.
DEC1 (one of the Excel-iﬁed gene names mentioned in the earlier paper) can mean one thing
to people working on many vertebrate species...

...but something else if you work on fruit ﬂies. Dangerous to make any assumptions when it
comes to gene names.

Consider one worm gene...

Here is one Caenorhabiditis elegans gene (abu-11) in WormBase. There is the official gene
name, a sequence name, 'other' names, the WormBase gene ID, plus other identiﬁers for
external databases which also describe the gene (there's also a protein ID, not shown here).

In C. elegans, gene names have a central naming authority (the CGC) but genes often get
renamed. Just look at these pqn genes which have been renamed or merged with other
genes.

This is the current view of the twk-43 gene in C. elegans (aka F32H5.7[abc]).

WormBase allows you to see the history behind genes. This gene started out as just F32H5.2,
a gene with no splice isoforms.

Then at some point it was split into 3 genes...

...before being converted into the current one gene (with four splice isoforms). Genes are
split and merged and renamed all the time. Relying on the common gene name (e.g. twk-43)
or the sequence identiﬁer (F32H5.7) can get you into trouble.

SOLUTIONS

What can be done to help with these sorts of problems?

Use ontologies and understand what those ontologies do.

Three main parts to a Gene Ontology term (GO term):
1) The name
2) The accession
3) The deﬁnition (which can change)

A fourth major part of a GO term is that it has ancestors and children. A single term is 'part
of' other terms and also 'is' examples of other terms. E.g. a nuclear outer membrane *is* a
nuclear membrane and is *part of* the cell.

Most model organism databases are loaded up with GO terms. E.g. you can search GO terms
from the 'front door' of FlyBase.

In WormBase, the same GO term search takes you directly to a gene page.

Scroll down on that gene page and we see the speciﬁed GO term...but what is an 'evidence
code', and what does 'IDA' mean?
Sadly the majority of people who use GO terms (as part of 'DAVID' analyses etc.) have no
knowledge of evidence codes

All GO terms should be connected to genes (or other database entries) with evidence codes.
Gives you an idea of how robust the assignment is. Databases like WormBase have curators
that scan papers (by eye, but also with software) to ﬁnd suitable GO terms that can be added
to genes on the basis of experiments described in the paper.

Most of the GO terms you will ever see have this evidence code. It is among the weakest of all
evidence (avoid any evidence which is 'non-traceable author statement'). It could simply
mean that a human protein (with some known information) was BLASTed against a yeast
genome and the resulting yeast match acquired the human meta-information as GO terms.
IEA codes should be treated with some suspicion.

48.2% of GO annotations
— in one of the best annotated eukaryotic animal genomes —
are generated automatically
The Gene Ontology website shows how many GO terms are attached to genes in different
organisms. Even in C. elegans (with >15 years of gene annotation), about half of the GO
terms are all in the IEA category.

Gene Ontology is not the only game in town. Sequence Ontology (SO) is widely used and a
subset of SO terms are used in GFF ﬁles to describe features (or at least they should be!).

GO and SO are part of OBO (Open Biological Ontologies: http://www.obofoundry.org).There
may be a community developing an ontology for your ﬁeld of interest. This site lists them all.

Use ontologies whenever possible
Don't assume that identifiers in existing databases are
the correct (or only) identifiers
Be careful when inflicting new database identifiers on
to the world!

On the last point, check whether your identifiers (even if they end up buried in supplementary
material somewhere) don't conflict with other databases out there. Long and boring
identifiers are usually the most stable and more easily parsed by scripts (although they are
the least human-friendly). But no spaces or asterisks in identifiers please!
This talk is KORF_labtalk_00000315

Better vocabulary = better bioinformatics with standardized databases and ontologies

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Better vocabulary = better bioinformatics with standardized databases and ontologies

Ähnlich wie Better vocabulary = better bioinformatics with standardized databases and ontologies (20)

Mehr von Keith Bradnam

Mehr von Keith Bradnam (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Better vocabulary = better bioinformatics with standardized databases and ontologies