Motivation:
Annotations are a key feature of many biological databases, used to convey our knowledge of a sequence to the reader. Ideally, annotations are curated manually, however manual curation is costly, time consuming and requires expert knowledge and training. Given these issues and the exponential increase of data, many databases implement automated annotation pipelines in an attempt to avoid un-annotated entries. Both manual and automated annotations vary in quality between databases and annotators, making assessment of annotation reliability problematic for users. The community lacks a generic measure for determining annotation quality and correctness, which we look at addressing within this work. Specifically we investigate word reuse within bulk textual annotations and relate this to Zipf's Principle of Least Effort. We use UniProt Knowledge Base (UniProtKB) as a case study to demonstrate this approach since it allows us to compare annotation change, both over time and between automated and manually curated annotations.
Results:
By applying power-law distributions to word reuse in annotation, we show clear trends in UniProtKB over time, which are consistent with existing studies of quality on free text English. Further, we show a clear distinction between manual and automated analysis and investigate cohorts of protein records as they mature. These results suggest that this approach holds distinct promise as a mechanism for judging annotation quality.
For more information available at the authors website: www.michaeljbell.co.uk
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB
1. An approach to describe and analyse bulk annotation
quality
Michael J Bell*, Colin Gillespie, Daniel Swan and Phillip Lord
*m.j.bell1@ncl.ac.uk
www.michaeljbell.co.uk
2. Talk Outline
• Annotation Quality? Why UniProtKB?
• Data extraction
• Applying power laws
• Analysing Swiss-Prot and TrEMBL annotation
• Discussion and Conclusion
• Questions
Michael J Bell @mj_bell
Newcastle University 2
m.j.bell1@ncl.ac.uk
3. Annotation Quality in UniProtKB
Michael J Bell @mj_bell
Newcastle University 3
m.j.bell1@ncl.ac.uk
4. ID PAX6_RAT Reviewed; 422 AA. CC -!- FUNCTION: Transcription factor with important functions in the DR GO; GO:0000790; C:nuclear chromatin; IDA:BHF-UCL.
AC P63016; A1A5N7; P32117; P70601; Q62222; Q64037; Q6QHS5; Q701Q8; CC development of the eye, nose, central nervous system and pancreas. DR GO; GO:0003680; F:AT DNA binding; IDA:RGD.
DT 31-AUG-2004, integrated into UniProtKB/Swiss-Prot. CC Required for the differentiation of pancreatic islet alpha cells. DR GO; GO:0003690; F:double-stranded DNA binding; IDA:RGD.
DT 31-AUG-2004, sequence version 1. CC Competes with PAX4 in binding to a common element in the glucagon, DR GO; GO:0000979; F:RNA polymerase II core promoter sequence-specific DNA binding; IC:BHF-UCL.
DT 11-JUL-2012, entry version 74. CC insulin and somatostatin promoters (By similarity). Regulates DR GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IC:BHF-U
DE RecName: Full=Paired box protein Pax-6; CC specification of the ventral neuron subtypes by establishing the DR GO; GO:0004842; F:ubiquitin-protein ligase activity; ISS:UniProtKB.
DE AltName: Full=Oculorhombin; CC correct progenitor domains. DR GO; GO:0030902; P:hindbrain development; IDA:RGD.
GN Name=Pax6; Synonyms=Pax-6, Sey; CC -!- SUBUNIT: Interacts with MAF and MAFB (By similarity). Interacts DR GO; GO:0050768; P:negative regulation of neurogenesis; ISS:UniProtKB.
OS Rattus norvegicus (Rat). CC with TRIM11; this interaction leads to ubiquitination and DR GO; GO:0001764; P:neuron migration; IMP:RGD.
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; CC proteasomal degradation, as well as inhibition of transactivation, DR GO; GO:0003322; P:pancreatic A cell development; IMP:BHF-UCL.
OC Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; CC possibly in part by preventing PAX6 binding to consensus DNA DR GO; GO:0042660; P:positive regulation of cell fate specification; IMP:RGD.
OC Muroidea; Muridae; Murinae; Rattus. CC sequences (By similarity). DR GO; GO:0045893; P:positive regulation of transcription, DNA-dependent; IC:BHF-UCL.
OX NCBI_TaxID=10116; CC -!- SUBCELLULAR LOCATION: Nucleus (By similarity). DR GO; GO:0050678; P:regulation of epithelial cell proliferation; IMP:RGD.
RN [1] CC -!- ALTERNATIVE PRODUCTS: DR GO; GO:0045664; P:regulation of neuron differentiation; IDA:RGD.
RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1). CC Event=Alternative splicing; Named isoforms=2; DR Gene3D; G3DSA:1.10.10.60; Homeodomain-rel; 1.
RA Gimlich R., Arnold G.S., Wawersik S., Maas R., Wong G.; CC Name=1; DR Gene3D; G3DSA:1.10.10.10; Wing_hlx_DNA_bd; 2.
RT "Pax-6 is required for pancreatic islet development."; CC IsoId=P63016-1; Sequence=Displayed; DR InterPro; IPR017970; Homeobox_CS.
RL Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases. CC Name=5a; Synonyms=Pax6-5a; DR InterPro; IPR001356; Homeodomain.
RN [2] CC IsoId=P63016-2; Sequence=VSP_011531; DR InterPro; IPR009057; Homeodomain-like.
RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- PTM: Ubiquitinated by TRIM11, leading to ubiquitination and DR InterPro; IPR001523; Paired_box_N.
RC STRAIN=New England Deaconess Hospital, and Sprague-Dawley; CC proteasomal degradation (By similarity). DR InterPro; IPR011991; WHTH_trsnscrt_rep_DNA-bd.
RA Karkour A., Wolf G.M., Walther R.; CC -!- DISEASE: Note=Defects in Pax6 are the cause of a condition known DR Pfam; PF00046; Homeobox; 1.
RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC as small eye (Sey) which results in the complete lack of eyes and DR Pfam; PF00292; PAX; 1.
RN [3] CC nasal primordia. DR PRINTS; PR00027; PAIREDBOX.
RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- SIMILARITY: Belongs to the paired homeobox family. DR SMART; SM00389; HOX; 1.
RC STRAIN=Sprague-Dawley; TISSUE=Brain; CC -!- SIMILARITY: Contains 1 homeobox DNA-binding domain. DR SMART; SM00351; PAX; 1.
RA Wei F.; CC -!- SIMILARITY: Contains 1 paired domain. DR SUPFAM; SSF46689; Homeodomain_like; 2.
RT "Cloning the homologic isoform gene pax6 5a in the rat."; CC ----------------------------------------------------------------------- DR PROSITE; PS00027; HOMEOBOX_1; 1.
RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms DR PROSITE; PS50071; HOMEOBOX_2; 1.
RN [4] CC Distributed under the Creative Commons Attribution-NoDerivs License DR PROSITE; PS00034; PAIRED_1; 1.
RP NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1). CC ----------------------------------------------------------------------- DR PROSITE; PS51057; PAIRED_2; 1.
RC TISSUE=Heart; DR EMBL; U69644; AAB09042.1; -; mRNA. PE 2: Evidence at transcript level;
RX PubMed=15489334; DOI=10.1101/gr.2596504; DR EMBL; AY540905; AAS48919.1; -; mRNA. KW Alternative splicing; Complete proteome; Developmental protein;
RG The MGC Project Team; DR EMBL; AY540906; AAS48920.1; -; mRNA. KW Differentiation; DNA-binding; Homeobox; Nucleus; Paired box;
RT "The status, quality, and expansion of the NIH full-length cDNA DR EMBL; AJ627631; CAF29075.1; -; mRNA. KW Reference proteome; Transcription; Transcription regulation;
RT project: the Mammalian Gene Collection (MGC)."; DR EMBL; BC128741; AAI28742.1; -; mRNA. KW Ubl conjugation.
RL Genome Res. 14:2121-2127(2004). DR EMBL; S74393; AAB32671.1; ALT_TERM; mRNA. FT CHAIN 1 422 Paired box protein Pax-6.
RN [5] DR IPI; IPI00231698; -. FT /FTId=PRO_0000050187.
RP PARTIAL NUCLEOTIDE SEQUENCE [MRNA], AND INVOLVEMENT IN SEY. DR IPI; IPI00464480; -. FT DOMAIN 4 130 Paired.
RC STRAIN=Sprague-Dawley; TISSUE=Embryo; DR PIR; S36166; S36166. FT DNA_BIND 210 269 Homeobox.
RX MEDLINE=95072652; PubMed=7981749; DOI=10.1038/ng0493-299; DR RefSeq; NP_037133.1; NM_013001.2. FT COMPBIAS 131 209 Gln/Gly-rich.
RA Matsuo T., Osumi-Yamashita N., Noji S., Ohuchi H., Koyama E., DR UniGene; Rn.89724; -. FT COMPBIAS 279 422 Pro/Ser/Thr-rich.
RA Myokai F., Matsuo N., Taniguchi S., Doi H., Iseki S., Ninomiya Y., DR ProteinModelPortal; P63016; -. FT VAR_SEQ 47 47 Q -> QTHADAKVQVLDSEN (in isoform 5a).
RA Fujiwara M., Wantanabe T., Eto K.; DR SMR; P63016; 4-136, 211-278. FT /FTId=VSP_011531.
RT "A mutation in the Pax-6 gene in rat small eye is associated with DR STRING; P63016; -. FT CONFLICT 159 159 R -> C (in Ref. 3; CAF29075).
RT impaired migration of midbrain crest cells."; DR Ensembl; ENSRNOT00000005882; ENSRNOP00000005882; ENSRNOG00000004410. FT CONFLICT 183 183 Q -> G (in Ref. 5; AAB32671).
RL Nat. Genet. 3:299-304(1993). DR Ensembl; ENSRNOT00000006302; ENSRNOP00000006302; ENSRNOG00000004410. SQ SEQUENCE 422 AA; 46754 MW; B0B2E5C176A518FE CRC64;
RN [6] DR GeneID; 25509; -. MQNSHSGVNQ LGGVFVNGRP LPDSTRQKIV ELAHSGARPC DISRILQVSN GCVSKILGRY
RP FUNCTION. DR KEGG; rno:25509; -. YETGSIRPRA IGGSKPRVAT PEVVSKIAQY KRECPSIFAW EIRDRLLSEG VCTNDNIPSV
RX MEDLINE=21869997; PubMed=11880342; DR UCSC; RGD:3258; rat. SSINRVLRNL ASEKQQMGAD GMYDKLRMLN GQTGSWGTRP GWYPGTSVPG QPTQDGCQQQ
RA Takahashi M., Osumi N.; DR CTD; 5080; -. EGQGENTNSI SSNGEDSDEA QMRLQLKRKL QRNRTSFTQE QIEALEKEFE RTHYPDVFAR
RT "Pax6 regulates specification of ventral neurone subtypes in the DR RGD; 3258; Pax6. ERLAAKIDLP EARIQVWFSN RRAKWRREEK LRNQRRQASN TPSHIPISSS FSTSVYQPIP
RT hindbrain by establishing progenitor domains."; DR eggNOG; NOG326044; -. QPTTPVSSFT SGSMLGRTDT ALTNTYSALP PMPSFTMANN LPMQPPVPSQ TSSYSCMLPT
RL Development 129:1327-1338(2002). DR GeneTree; ENSGT00650000093130; -. SPSVNGRSYD TYTPPHMQTH MNSQPMGTSG TTSTGLISPG VSVPVQVPGS EPDMSQYWPR
DR HOVERGEN; HBG009115; -. LQ
DR KO; K08031; -. //
5. ID PAX6_RAT Reviewed; 422 AA. CC -!- FUNCTION: Transcription factor with important functions in the DR GO; GO:0000790; C:nuclear chromatin; IDA:BHF-UCL.
AC P63016; A1A5N7; P32117; P70601; Q62222; Q64037; Q6QHS5; Q701Q8; CC development of the eye, nose, central nervous system and pancreas. DR GO; GO:0003680; F:AT DNA binding; IDA:RGD.
DT 31-AUG-2004, integrated into UniProtKB/Swiss-Prot. CC Required for the differentiation of pancreatic islet alpha cells. DR GO; GO:0003690; F:double-stranded DNA binding; IDA:RGD.
DT 31-AUG-2004, sequence version 1. CC Competes with PAX4 in binding to a common element in the glucagon, DR GO; GO:0000979; F:RNA polymerase II core promoter sequence-specific DNA binding; IC:BHF-UCL.
DT 11-JUL-2012, entry version 74. CC insulin and somatostatin promoters (By similarity). Regulates DR GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IC:BHF-U
DE RecName: Full=Paired box protein Pax-6; CC specification of the ventral neuron subtypes by establishing the DR GO; GO:0004842; F:ubiquitin-protein ligase activity; ISS:UniProtKB.
DE AltName: Full=Oculorhombin; CC correct progenitor domains. DR GO; GO:0030902; P:hindbrain development; IDA:RGD.
GN Name=Pax6; Synonyms=Pax-6, Sey; CC -!- SUBUNIT: Interacts with MAF and MAFB (By similarity). Interacts DR GO; GO:0050768; P:negative regulation of neurogenesis; ISS:UniProtKB.
OS Rattus norvegicus (Rat). CC with TRIM11; this interaction leads to ubiquitination and DR GO; GO:0001764; P:neuron migration; IMP:RGD.
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; CC proteasomal degradation, as well as inhibition of transactivation, DR GO; GO:0003322; P:pancreatic A cell development; IMP:BHF-UCL.
OC Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; CC possibly in part by preventing PAX6 binding to consensus DNA DR GO; GO:0042660; P:positive regulation of cell fate specification; IMP:RGD.
OC Muroidea; Muridae; Murinae; Rattus. CC sequences (By similarity). DR GO; GO:0045893; P:positive regulation of transcription, DNA-dependent; IC:BHF-UCL.
OX NCBI_TaxID=10116; CC -!- SUBCELLULAR LOCATION: Nucleus (By similarity). DR GO; GO:0050678; P:regulation of epithelial cell proliferation; IMP:RGD.
RN [1] CC -!- ALTERNATIVE PRODUCTS: DR GO; GO:0045664; P:regulation of neuron differentiation; IDA:RGD.
RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1). CC Event=Alternative splicing; Named isoforms=2; DR Gene3D; G3DSA:1.10.10.60; Homeodomain-rel; 1.
RA Gimlich R., Arnold G.S., Wawersik S., Maas R., Wong G.; CC Name=1; DR Gene3D; G3DSA:1.10.10.10; Wing_hlx_DNA_bd; 2.
RT "Pax-6 is required for pancreatic islet development."; CC IsoId=P63016-1; Sequence=Displayed; DR InterPro; IPR017970; Homeobox_CS.
RL Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases. CC Name=5a; Synonyms=Pax6-5a; DR InterPro; IPR001356; Homeodomain.
RN [2] CC IsoId=P63016-2; Sequence=VSP_011531; DR InterPro; IPR009057; Homeodomain-like.
RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- PTM: Ubiquitinated by TRIM11, leading to ubiquitination and DR InterPro; IPR001523; Paired_box_N.
RC STRAIN=New England Deaconess Hospital, and Sprague-Dawley; CC proteasomal degradation (By similarity). DR InterPro; IPR011991; WHTH_trsnscrt_rep_DNA-bd.
RA Karkour A., Wolf G.M., Walther R.; CC -!- DISEASE: Note=Defects in Pax6 are the cause of a condition known DR Pfam; PF00046; Homeobox; 1.
RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC as small eye (Sey) which results in the complete lack of eyes and DR Pfam; PF00292; PAX; 1.
RN [3] CC nasal primordia. DR PRINTS; PR00027; PAIREDBOX.
RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- SIMILARITY: Belongs to the paired homeobox family. DR SMART; SM00389; HOX; 1.
RC STRAIN=Sprague-Dawley; TISSUE=Brain; CC -!- SIMILARITY: Contains 1 homeobox DNA-binding domain. DR SMART; SM00351; PAX; 1.
RA Wei F.; CC -!- SIMILARITY: Contains 1 paired domain. DR SUPFAM; SSF46689; Homeodomain_like; 2.
RT "Cloning the homologic isoform gene pax6 5a in the rat."; CC ----------------------------------------------------------------------- DR PROSITE; PS00027; HOMEOBOX_1; 1.
RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms DR PROSITE; PS50071; HOMEOBOX_2; 1.
RN [4] CC Distributed under the Creative Commons Attribution-NoDerivs License DR PROSITE; PS00034; PAIRED_1; 1.
RP NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1). CC ----------------------------------------------------------------------- DR PROSITE; PS51057; PAIRED_2; 1.
RC TISSUE=Heart; DR EMBL; U69644; AAB09042.1; -; mRNA. PE 2: Evidence at transcript level;
RX PubMed=15489334; DOI=10.1101/gr.2596504; DR EMBL; AY540905; AAS48919.1; -; mRNA. KW Alternative splicing; Complete proteome; Developmental protein;
RG The MGC Project Team; DR EMBL; AY540906; AAS48920.1; -; mRNA. KW Differentiation; DNA-binding; Homeobox; Nucleus; Paired box;
RT "The status, quality, and expansion of the NIH full-length cDNA DR EMBL; AJ627631; CAF29075.1; -; mRNA. KW Reference proteome; Transcription; Transcription regulation;
RT project: the Mammalian Gene Collection (MGC)."; DR EMBL; BC128741; AAI28742.1; -; mRNA. KW Ubl conjugation.
RL Genome Res. 14:2121-2127(2004). DR EMBL; S74393; AAB32671.1; ALT_TERM; mRNA. FT CHAIN 1 422 Paired box protein Pax-6.
RN [5] DR IPI; IPI00231698; -. FT /FTId=PRO_0000050187.
RP PARTIAL NUCLEOTIDE SEQUENCE [MRNA], AND INVOLVEMENT IN SEY. DR IPI; IPI00464480; -. FT DOMAIN 4 130 Paired.
RC STRAIN=Sprague-Dawley; TISSUE=Embryo; DR PIR; S36166; S36166. FT DNA_BIND 210 269 Homeobox.
RX MEDLINE=95072652; PubMed=7981749; DOI=10.1038/ng0493-299; DR RefSeq; NP_037133.1; NM_013001.2. FT COMPBIAS 131 209 Gln/Gly-rich.
RA Matsuo T., Osumi-Yamashita N., Noji S., Ohuchi H., Koyama E., DR UniGene; Rn.89724; -. FT COMPBIAS 279 422 Pro/Ser/Thr-rich.
RA Myokai F., Matsuo N., Taniguchi S., Doi H., Iseki S., Ninomiya Y., DR ProteinModelPortal; P63016; -. FT VAR_SEQ 47 47 Q -> QTHADAKVQVLDSEN (in isoform 5a).
RA Fujiwara M., Wantanabe T., Eto K.; DR SMR; P63016; 4-136, 211-278. FT /FTId=VSP_011531.
RT "A mutation in the Pax-6 gene in rat small eye is associated with DR STRING; P63016; -. FT CONFLICT 159 159 R -> C (in Ref. 3; CAF29075).
RT impaired migration of midbrain crest cells."; DR Ensembl; ENSRNOT00000005882; ENSRNOP00000005882; ENSRNOG00000004410. FT CONFLICT 183 183 Q -> G (in Ref. 5; AAB32671).
RL Nat. Genet. 3:299-304(1993). DR Ensembl; ENSRNOT00000006302; ENSRNOP00000006302; ENSRNOG00000004410. SQ SEQUENCE 422 AA; 46754 MW; B0B2E5C176A518FE CRC64;
RN [6] DR GeneID; 25509; -. MQNSHSGVNQ LGGVFVNGRP LPDSTRQKIV ELAHSGARPC DISRILQVSN GCVSKILGRY
RP FUNCTION. DR KEGG; rno:25509; -. YETGSIRPRA IGGSKPRVAT PEVVSKIAQY KRECPSIFAW EIRDRLLSEG VCTNDNIPSV
RX MEDLINE=21869997; PubMed=11880342; DR UCSC; RGD:3258; rat. SSINRVLRNL ASEKQQMGAD GMYDKLRMLN GQTGSWGTRP GWYPGTSVPG QPTQDGCQQQ
RA Takahashi M., Osumi N.; DR CTD; 5080; -. EGQGENTNSI SSNGEDSDEA QMRLQLKRKL QRNRTSFTQE QIEALEKEFE RTHYPDVFAR
RT "Pax6 regulates specification of ventral neurone subtypes in the DR RGD; 3258; Pax6. ERLAAKIDLP EARIQVWFSN RRAKWRREEK LRNQRRQASN TPSHIPISSS FSTSVYQPIP
RT hindbrain by establishing progenitor domains."; DR eggNOG; NOG326044; -. QPTTPVSSFT SGSMLGRTDT ALTNTYSALP PMPSFTMANN LPMQPPVPSQ TSSYSCMLPT
RL Development 129:1327-1338(2002). DR GeneTree; ENSGT00650000093130; -. SPSVNGRSYD TYTPPHMQTH MNSQPMGTSG TTSTGLISPG VSVPVQVPGS EPDMSQYWPR
DR HOVERGEN; HBG009115; -. LQ
DR KO; K08031; -. //
6. ID PAX6_RAT Reviewed; 422 AA. CC -!- FUNCTION: Transcription factor with important functions in the DR GO; GO:0000790; C:nuclear chromatin; IDA:BHF-UCL.
AC P63016; A1A5N7; P32117; P70601; Q62222; Q64037; Q6QHS5; Q701Q8; CC development of the eye, nose, central nervous system and pancreas. DR GO; GO:0003680; F:AT DNA binding; IDA:RGD.
DT 31-AUG-2004, integrated into UniProtKB/Swiss-Prot. CC Required for the differentiation of pancreatic islet alpha cells. DR GO; GO:0003690; F:double-stranded DNA binding; IDA:RGD.
DT 31-AUG-2004, sequence version 1. CC Competes with PAX4 in binding to a common element in the glucagon, DR GO; GO:0000979; F:RNA polymerase II core promoter sequence-specific DNA binding; IC:BHF-UCL.
DT 11-JUL-2012, entry version 74. CC insulin and somatostatin promoters (By similarity). Regulates DR GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IC:BHF-U
DE RecName: Full=Paired box protein Pax-6; CC specification of the ventral neuron subtypes by establishing the DR GO; GO:0004842; F:ubiquitin-protein ligase activity; ISS:UniProtKB.
DE AltName: Full=Oculorhombin; CC correct progenitor domains. DR GO; GO:0030902; P:hindbrain development; IDA:RGD.
GN Name=Pax6; Synonyms=Pax-6, Sey; CC -!- SUBUNIT: Interacts with MAF and MAFB (By similarity). Interacts DR GO; GO:0050768; P:negative regulation of neurogenesis; ISS:UniProtKB.
OS Rattus norvegicus (Rat). CC with TRIM11; this interaction leads to ubiquitination and DR GO; GO:0001764; P:neuron migration; IMP:RGD.
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; CC proteasomal degradation, as well as inhibition of transactivation, DR GO; GO:0003322; P:pancreatic A cell development; IMP:BHF-UCL.
OC Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; CC possibly in part by preventing PAX6 binding to consensus DNA DR GO; GO:0042660; P:positive regulation of cell fate specification; IMP:RGD.
OC Muroidea; Muridae; Murinae; Rattus. CC sequences (By similarity). DR GO; GO:0045893; P:positive regulation of transcription, DNA-dependent; IC:BHF-UCL.
OX NCBI_TaxID=10116; CC -!- SUBCELLULAR LOCATION: Nucleus (By similarity). DR GO; GO:0050678; P:regulation of epithelial cell proliferation; IMP:RGD.
RN [1] CC -!- ALTERNATIVE PRODUCTS: DR GO; GO:0045664; P:regulation of neuron differentiation; IDA:RGD.
RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1). CC Event=Alternative splicing; Named isoforms=2; DR Gene3D; G3DSA:1.10.10.60; Homeodomain-rel; 1.
RA Gimlich R., Arnold G.S., Wawersik S., Maas R., Wong G.; CC Name=1; DR Gene3D; G3DSA:1.10.10.10; Wing_hlx_DNA_bd; 2.
RT "Pax-6 is required for pancreatic islet development."; CC IsoId=P63016-1; Sequence=Displayed; DR InterPro; IPR017970; Homeobox_CS.
RL Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases. CC Name=5a; Synonyms=Pax6-5a; DR InterPro; IPR001356; Homeodomain.
RN [2] CC IsoId=P63016-2; Sequence=VSP_011531; DR InterPro; IPR009057; Homeodomain-like.
RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- PTM: Ubiquitinated by TRIM11, leading to ubiquitination and DR InterPro; IPR001523; Paired_box_N.
RC STRAIN=New England Deaconess Hospital, and Sprague-Dawley; CC proteasomal degradation (By similarity). DR InterPro; IPR011991; WHTH_trsnscrt_rep_DNA-bd.
RA Karkour A., Wolf G.M., Walther R.; CC -!- DISEASE: Note=Defects in Pax6 are the cause of a condition known DR Pfam; PF00046; Homeobox; 1.
RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC as small eye (Sey) which results in the complete lack of eyes and DR Pfam; PF00292; PAX; 1.
RN [3] CC nasal primordia. DR PRINTS; PR00027; PAIREDBOX.
RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- SIMILARITY: Belongs to the paired homeobox family. DR SMART; SM00389; HOX; 1.
RC STRAIN=Sprague-Dawley; TISSUE=Brain; CC -!- SIMILARITY: Contains 1 homeobox DNA-binding domain. DR SMART; SM00351; PAX; 1.
RA Wei F.; CC -!- SIMILARITY: Contains 1 paired domain. DR SUPFAM; SSF46689; Homeodomain_like; 2.
RT "Cloning the homologic isoform gene pax6 5a in the rat."; CC ----------------------------------------------------------------------- DR PROSITE; PS00027; HOMEOBOX_1; 1.
RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms DR PROSITE; PS50071; HOMEOBOX_2; 1.
RN [4] CC Distributed under the Creative Commons Attribution-NoDerivs License DR PROSITE; PS00034; PAIRED_1; 1.
RP NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1). CC ----------------------------------------------------------------------- DR PROSITE; PS51057; PAIRED_2; 1.
RC TISSUE=Heart; DR EMBL; U69644; AAB09042.1; -; mRNA. PE 2: Evidence at transcript level;
RX PubMed=15489334; DOI=10.1101/gr.2596504; DR EMBL; AY540905; AAS48919.1; -; mRNA. KW Alternative splicing; Complete proteome; Developmental protein;
RG The MGC Project Team; DR EMBL; AY540906; AAS48920.1; -; mRNA. KW Differentiation; DNA-binding; Homeobox; Nucleus; Paired box;
RT "The status, quality, and expansion of the NIH full-length cDNA DR EMBL; AJ627631; CAF29075.1; -; mRNA. KW Reference proteome; Transcription; Transcription regulation;
RT project: the Mammalian Gene Collection (MGC)."; DR EMBL; BC128741; AAI28742.1; -; mRNA. KW Ubl conjugation.
RL Genome Res. 14:2121-2127(2004). DR EMBL; S74393; AAB32671.1; ALT_TERM; mRNA. FT CHAIN 1 422 Paired box protein Pax-6.
RN [5] DR IPI; IPI00231698; -. FT /FTId=PRO_0000050187.
RP PARTIAL NUCLEOTIDE SEQUENCE [MRNA], AND INVOLVEMENT IN SEY. DR IPI; IPI00464480; -. FT DOMAIN 4 130 Paired.
RC STRAIN=Sprague-Dawley; TISSUE=Embryo; DR PIR; S36166; S36166. FT DNA_BIND 210 269 Homeobox.
RX MEDLINE=95072652; PubMed=7981749; DOI=10.1038/ng0493-299; DR RefSeq; NP_037133.1; NM_013001.2. FT COMPBIAS 131 209 Gln/Gly-rich.
RA Matsuo T., Osumi-Yamashita N., Noji S., Ohuchi H., Koyama E., DR UniGene; Rn.89724; -. FT COMPBIAS 279 422 Pro/Ser/Thr-rich.
RA Myokai F., Matsuo N., Taniguchi S., Doi H., Iseki S., Ninomiya Y., DR ProteinModelPortal; P63016; -. FT VAR_SEQ 47 47 Q -> QTHADAKVQVLDSEN (in isoform 5a).
RA Fujiwara M., Wantanabe T., Eto K.; DR SMR; P63016; 4-136, 211-278. FT /FTId=VSP_011531.
RT "A mutation in the Pax-6 gene in rat small eye is associated with DR STRING; P63016; -. FT CONFLICT 159 159 R -> C (in Ref. 3; CAF29075).
RT impaired migration of midbrain crest cells."; DR Ensembl; ENSRNOT00000005882; ENSRNOP00000005882; ENSRNOG00000004410. FT CONFLICT 183 183 Q -> G (in Ref. 5; AAB32671).
RL Nat. Genet. 3:299-304(1993). DR Ensembl; ENSRNOT00000006302; ENSRNOP00000006302; ENSRNOG00000004410. SQ SEQUENCE 422 AA; 46754 MW; B0B2E5C176A518FE CRC64;
RN [6] DR GeneID; 25509; -. MQNSHSGVNQ LGGVFVNGRP LPDSTRQKIV ELAHSGARPC DISRILQVSN GCVSKILGRY
RP FUNCTION. DR KEGG; rno:25509; -. YETGSIRPRA IGGSKPRVAT PEVVSKIAQY KRECPSIFAW EIRDRLLSEG VCTNDNIPSV
RX MEDLINE=21869997; PubMed=11880342; DR UCSC; RGD:3258; rat. SSINRVLRNL ASEKQQMGAD GMYDKLRMLN GQTGSWGTRP GWYPGTSVPG QPTQDGCQQQ
RA Takahashi M., Osumi N.; DR CTD; 5080; -. EGQGENTNSI SSNGEDSDEA QMRLQLKRKL QRNRTSFTQE QIEALEKEFE RTHYPDVFAR
RT "Pax6 regulates specification of ventral neurone subtypes in the DR RGD; 3258; Pax6. ERLAAKIDLP EARIQVWFSN RRAKWRREEK LRNQRRQASN TPSHIPISSS FSTSVYQPIP
RT hindbrain by establishing progenitor domains."; DR eggNOG; NOG326044; -. QPTTPVSSFT SGSMLGRTDT ALTNTYSALP PMPSFTMANN LPMQPPVPSQ TSSYSCMLPT
RL Development 129:1327-1338(2002). DR GeneTree; ENSGT00650000093130; -. SPSVNGRSYD TYTPPHMQTH MNSQPMGTSG TTSTGLISPG VSVPVQVPGS EPDMSQYWPR
DR HOVERGEN; HBG009115; -. LQ
DR KO; K08031; -. //
7. Functional Annotation
• Annotation is overloaded:
– Here we mean “high level”
• Knowledge associated with the data
• Aimed at the human reader
Michael J Bell @mj_bell
Newcastle University 7
m.j.bell1@ncl.ac.uk
8. Michael J Bell @mj_bell
Newcastle University 8
m.j.bell1@ncl.ac.uk
9. Swiss-Prot Entry
P26367 – PAX6_HUMAN
(Homo sapiens)
43 Sentences
Michael J Bell @mj_bell
Newcastle University 9
m.j.bell1@ncl.ac.uk
10. Michael J Bell @mj_bell
Newcastle University 10
m.j.bell1@ncl.ac.uk
11. TrEMBL Entry
A4PBK5 – A4PBK5_9METZ
(Ephydatia fluviatilis)
1 Sentence
Michael J Bell @mj_bell
Newcastle University 11
m.j.bell1@ncl.ac.uk
12. Annotation Quality
• Annotation is highly variable
– E.g. Automated Vs. Manual
• Current approaches rely upon specific
database structure/features
– Ontology
– Evidence Codes
• Can we develop a metric based on free text?
Michael J Bell @mj_bell
Newcastle University 12
m.j.bell1@ncl.ac.uk
13. Why UniProtKB?
• UniProtKB is well known and established
• Number of technical reasons:
– UniProtKB composed of TrEMBL and Swiss-Prot
– Historical version
– Cross species
• Lack of gold standard
Michael J Bell @mj_bell
Newcastle University 13
m.j.bell1@ncl.ac.uk
15. Investigating Word Occurrences
• Extract word occurrence from all annotation
Michael J Bell @mj_bell
Newcastle University 15
m.j.bell1@ncl.ac.uk
16. Investigating Word Occurrences
• Extract word occurrence from all annotation
1. Protein
2. Proteins
3. Chains
4. Chain
5. Sequence
6. Enzyme
7. Complex
Michael J Bell @mj_bell
Newcastle University 16
m.j.bell1@ncl.ac.uk
17. Word Occurrences in Wikipedia
Taken from: http://en.wikipedia.org/wiki/File:Wikipedia-n-zipf.png
Michael J Bell @mj_bell
Newcastle University 17
m.j.bell1@ncl.ac.uk
18. Zipf’s Principle of Least Effort
• Take word occurrences and apply to Zipf’s
Principle of Least Effort
• Human nature to take path of least effort
when achieving a goal
α Value Examples in literature Least effort for
α < 1.6 Advanced Schizophrenia, Young children -
1.6 < α < 2 Military Combat Texts, Wikipedia, Web pages listed on the open Annotator
directory project
α=2 Single author texts Equal
2 < α < 2.4 Multi author texts Audience
α > 2.4 Fragmented discourse schizophrenia -
Michael J Bell @mj_bell
Newcastle University 18
m.j.bell1@ncl.ac.uk
20. The Model & Resulting Graphs
• Power Law Distribution
• Logarithmic scales
• X-axis – Size
• Y-Axis – Probability
• A point represents
probability a word will
occur X or more times
• E.g. upper left most point:
– Probability word occurs once = 10^0
Michael J Bell @mj_bell
Newcastle University 20
m.j.bell1@ncl.ac.uk
21. Does UniProtKB obey a power-law?
• Broadly, yes. However, distinct structure?
Michael J Bell @mj_bell
Newcastle University 21
m.j.bell1@ncl.ac.uk
22. The removal of copyright
• Development of two slopes
– As seen in mature resources
Michael J Bell @mj_bell
Newcastle University 22
m.j.bell1@ncl.ac.uk
23. Quality of Biological Knowledge?
• How does automated annotation compare to
manual annotation?
– i.e. TrEMBL Vs. Swiss-Prot
• Assume Swiss-Prot acts as a more mature
resource than TrEMBL
• Analyse this by comparing annotations at
equivalent points in time
Michael J Bell @mj_bell
Newcastle University 23
m.j.bell1@ncl.ac.uk
25. Viewing over time
• Show just alpha
values
• Appears to be
becoming
optimised (least
effort) for
annotator
Michael J Bell @mj_bell
Newcastle University 25
m.j.bell1@ncl.ac.uk
26. Annotation Maturity
• Does this decrease happen because entries
are, on average, getting older?
Michael J Bell @mj_bell
Newcastle University 26
m.j.bell1@ncl.ac.uk
27. Annotation Maturity
• Want to abstract from size and analyse how
individual records are maturing
• Need essentially a set of records which relate
to a defined set of proteins
• Therefore extract entries common in both
Swiss-Prot V9 and UniProtKB V15
Michael J Bell @mj_bell
Newcastle University 27
m.j.bell1@ncl.ac.uk
29. Analysing new annotations
• Mature entries are decreasing
• How are new annotations impacted?
• Take annotations from entries that appear for
the first time in a given database version
Michael J Bell @mj_bell
Newcastle University 29
m.j.bell1@ncl.ac.uk
30. The impact of new annotations
Michael J Bell @mj_bell
Newcastle University 30
m.j.bell1@ncl.ac.uk
31. Explanation for the decrease?
• Annotation curation involves identifying
similar entries
• Annotations between these entries are
standardised
• Is this standardisation changing the way
entries are annotated?
– Subsequently placing the least effort onto the
annotator?
Michael J Bell @mj_bell
Newcastle University 31
m.j.bell1@ncl.ac.uk
32. Conclusions
• Approach acting as a quality measure
– Detection of artefacts
– Distinction between TrEMBL and Swiss-Prot
• Annotations in UniProtKB are becoming
optimised for the annotator rather than the
reader
– Constant increase of data & pressure on curators
– Also true for existing and new annotations
Michael J Bell @mj_bell
Newcastle University 32
m.j.bell1@ncl.ac.uk
33. Summary
• The biological community lacks a generic quality
metric that allows biological annotation to be
quantitatively assessed and compared.
• Here we investigated word reuse within bulk
textual annotation and related it to Zipf's
Principle of Least Effort.
• Straight forward approach once data extracted
• Holds promise of being useful for curators and
end users
Michael J Bell @mj_bell
Newcastle University 33
m.j.bell1@ncl.ac.uk
34. Colin Gillespie, Daniel Swan
Thank You! and Phillip Lord
Many thanks go to:
Allyson Lister1
Daniel Barrell2
Michael Bell
UniProt Helpdesk
1 Newcastle
m.j.bell1@ncl.ac.uk
University, UK
2 EBIMichael J Bell @mj_bell
m.j.bell1@ncl.ac.uk
Newcastle University www.michaeljbell.co.uk
34
Hinweis der Redaktion
For example this is an analysis over Wikipedia. And we find that word occurrence size ranked by the word broadly obeys a power law. Taken from - http://en.wikipedia.org/wiki/File:Wikipedia-n-zipf.png
We can relate these power laws to Zipf's principle of least effort. This states that... Point about reader and author. Different texts resolve this in different ways. By taking the exponenet of the regression line – alpha – we can see that Wikipedia has is least effort is placed on the curator.
The first step of our approach is to extract the necessary data from UniProtKB. Our extraction process consists of 4 key steps. Firstly we obtain each version of Swiss-Prot and TrEMBL and then extract just those lines that hold comments. We then extract all the words from this data, and remove topic and block headings. We can then count how frequently each word occurs, with the output being a list of all words and their occurrence.
We can then apply a power law distribution to this data. The result of which is a graph, as shown here. The graph is actually represented as a cumulative distribution function, and is shown on logarithmic scales. Along the X axis we have the size of a word – that is how frequently it occurs, whilst along the Y axis we have the probability of a word occurring X or more times. This graph isn’t straight forward – so as an example, the top left point represents that the probability a word occurs once is 1, as only words that occur within the corpus are used. Conversely, the point at the bottom right shows that the probability of a word occuring over 100,000 times is very small. Using this approach we can now initially apply it to Swiss-Prot
The first question to ask is – does Swiss-Prot obey a power law? And it does boradly appear to, yes. However, there is a distinct structure or kink in the tail of the graph in a number of versions. So the question here is, what is this kink?
Copyright statement added to every entry in a version. Therefore we see these statements here.Sort out graphsIt turns out to be copyright statements. This shows that using this approach we can detect the introduction of data with no biological significance. It also shows that our approach is acting as a measure of quality, albeit for detecting artifacts.
Although Swiss-Prot obeys a power-law, does it relate to the quality of biological knowledge? One way we can address this quesiton is to compare automated and manual annotation.As shown previously, we can make the assumption that swiss-prot acts as a more mature resource than trembl. So by analysing annotations at similar points in time between the two resources, we would expect swiss-prot to act as a more mature resource.
By overlaying the graphs for TrEMBL and Swiss-Prot we can more clearly see how they mature over time. It is clear from this slide that they appear to diverge over time, with TrEMBL showing higher levels or re-use and swiss-prot showing a richer use of vocabulary. So it does indeed suggest our approach is acting as a measure of quality. However, the main analytical value from these graphs from the alpha value
So we can show the alpha value over time for both swiss-prot and trembl. This also provides a clearer image of effort over time.We can see how both databases show a decrease over time – that is they appear to be becoming least effort for the annotator – although this progression is much more irregular in TrEMBL.This view shows two major disjuncts in TrEMBL – which appears to coincide with changes to the underlying annotation process in TrEMBL. One possible explanation for this decrease is due to the age of entries.
So one possibility for this decrease is that entries, on average, getting older as the database is getting older. This isn’t the case however, entries are getting younger. This is mainly due to new records being added exponentially – outnumbering the old records.So we ask if the decrease happens because entries are, on average, getting older? However this isn't true as actually average age isn't getting older and is decreasing over time.
So here we want to abstract form the size of the database and ask how are individual records maturing?This isn’t straight forward – essentially we need a set of records which relate to a defined set of proteinsTherefore, we extract those entries that are common in SWP 9 and UPSP 15, providing a span of over 20 years.
Again, highlight slope of graph we are looking at – and that we are looking at the subsets.Like with the database as a whole, we again see a decrease. However, this isn’t as low as the remainder of the database.
Re-iterate earlier question – and that this is another way to look at it. All annotations are approx of same age, as they are new.
As we see a decrease in the mature entries – how do the new entries fare?Similarly they decrease over time, which is the same pattern as all other graphsWhy do we see all of these decreases...?
The protocol was recently published (2011) and again shows advantage of using UniProtKB, as well documented.Need to be careful we don't say “standardisation == poor quality”, it is just something that can explain it, rather than a definitive answer. Rather, it is more likely that trying to be consistent has lost some of the more “personal” annotations to entries, and thus become more generic.TOO WORDY
Try finish on a high here. Give a very quick and brief recap of the main idea and points, and how it is “easy” and could be useful for both curators and end users alike.
Word Cloud is from UniProtKB/Swiss-Prot Version 15
Number of competing models were considered. However, Power Law distribution gives a good balance between model parsimony and fit.Only deal with discrete power law distribution here – which has the probability mass function described.To fit the power-law distribution we followed a Bayesian paradigm.Xmin, determined using the BIC criteria, was set to 50 throughout.