SlideShare ist ein Scribd-Unternehmen logo
1 von 34
An approach to describe and analyse bulk annotation
                      quality
    Michael J Bell*, Colin Gillespie, Daniel Swan and Phillip Lord
 *m.j.bell1@ncl.ac.uk
www.michaeljbell.co.uk
Talk Outline
•    Annotation Quality? Why UniProtKB?
•    Data extraction
•    Applying power laws
•    Analysing Swiss-Prot and TrEMBL annotation
•    Discussion and Conclusion
•    Questions


Michael J Bell @mj_bell
                             Newcastle University   2
m.j.bell1@ncl.ac.uk
Annotation Quality in UniProtKB




Michael J Bell @mj_bell
                          Newcastle University   3
m.j.bell1@ncl.ac.uk
ID   PAX6_RAT             Reviewed;       422 AA.                             CC   -!- FUNCTION: Transcription factor with important functions in the        DR    GO; GO:0000790; C:nuclear chromatin; IDA:BHF-UCL.
AC   P63016; A1A5N7; P32117; P70601; Q62222; Q64037; Q6QHS5; Q701Q8;          CC      development of the eye, nose, central nervous system and pancreas.     DR    GO; GO:0003680; F:AT DNA binding; IDA:RGD.
DT   31-AUG-2004, integrated into UniProtKB/Swiss-Prot.                       CC      Required for the differentiation of pancreatic islet alpha cells.      DR    GO; GO:0003690; F:double-stranded DNA binding; IDA:RGD.
DT   31-AUG-2004, sequence version 1.                                         CC      Competes with PAX4 in binding to a common element in the glucagon,     DR    GO; GO:0000979; F:RNA polymerase II core promoter sequence-specific DNA binding; IC:BHF-UCL.
DT   11-JUL-2012, entry version 74.                                           CC      insulin and somatostatin promoters (By similarity). Regulates          DR    GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IC:BHF-U
DE   RecName: Full=Paired box protein Pax-6;                                  CC      specification of the ventral neuron subtypes by establishing the       DR    GO; GO:0004842; F:ubiquitin-protein ligase activity; ISS:UniProtKB.
DE   AltName: Full=Oculorhombin;                                              CC      correct progenitor domains.                                            DR    GO; GO:0030902; P:hindbrain development; IDA:RGD.
GN    Name=Pax6; Synonyms=Pax-6, Sey;                                         CC   -!- SUBUNIT: Interacts with MAF and MAFB (By similarity). Interacts       DR    GO; GO:0050768; P:negative regulation of neurogenesis; ISS:UniProtKB.
OS    Rattus norvegicus (Rat).                                                CC      with TRIM11; this interaction leads to ubiquitination and              DR    GO; GO:0001764; P:neuron migration; IMP:RGD.
OC    Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;       CC      proteasomal degradation, as well as inhibition of transactivation,     DR    GO; GO:0003322; P:pancreatic A cell development; IMP:BHF-UCL.
OC    Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi;   CC      possibly in part by preventing PAX6 binding to consensus DNA           DR    GO; GO:0042660; P:positive regulation of cell fate specification; IMP:RGD.
OC    Muroidea; Muridae; Murinae; Rattus.                                     CC      sequences (By similarity).                                             DR    GO; GO:0045893; P:positive regulation of transcription, DNA-dependent; IC:BHF-UCL.
OX   NCBI_TaxID=10116;                                                        CC   -!- SUBCELLULAR LOCATION: Nucleus (By similarity).                        DR    GO; GO:0050678; P:regulation of epithelial cell proliferation; IMP:RGD.
RN    [1]                                                                     CC   -!- ALTERNATIVE PRODUCTS:                                                 DR    GO; GO:0045664; P:regulation of neuron differentiation; IDA:RGD.
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1).                                  CC      Event=Alternative splicing; Named isoforms=2;                          DR    Gene3D; G3DSA:1.10.10.60; Homeodomain-rel; 1.
RA   Gimlich R., Arnold G.S., Wawersik S., Maas R., Wong G.;                  CC      Name=1;                                                                DR    Gene3D; G3DSA:1.10.10.10; Wing_hlx_DNA_bd; 2.
RT   "Pax-6 is required for pancreatic islet development.";                   CC        IsoId=P63016-1; Sequence=Displayed;                                  DR    InterPro; IPR017970; Homeobox_CS.
RL   Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases.                 CC      Name=5a; Synonyms=Pax6-5a;                                             DR    InterPro; IPR001356; Homeodomain.
RN    [2]                                                                     CC        IsoId=P63016-2; Sequence=VSP_011531;                                 DR    InterPro; IPR009057; Homeodomain-like.
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A).                                 CC   -!- PTM: Ubiquitinated by TRIM11, leading to ubiquitination and           DR    InterPro; IPR001523; Paired_box_N.
RC   STRAIN=New England Deaconess Hospital, and Sprague-Dawley;               CC      proteasomal degradation (By similarity).                               DR    InterPro; IPR011991; WHTH_trsnscrt_rep_DNA-bd.
RA   Karkour A., Wolf G.M., Walther R.;                                       CC   -!- DISEASE: Note=Defects in Pax6 are the cause of a condition known      DR    Pfam; PF00046; Homeobox; 1.
RL   Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases.                 CC      as small eye (Sey) which results in the complete lack of eyes and      DR    Pfam; PF00292; PAX; 1.
RN    [3]                                                                     CC      nasal primordia.                                                       DR    PRINTS; PR00027; PAIREDBOX.
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A).                                 CC   -!- SIMILARITY: Belongs to the paired homeobox family.                    DR    SMART; SM00389; HOX; 1.
RC   STRAIN=Sprague-Dawley; TISSUE=Brain;                                     CC   -!- SIMILARITY: Contains 1 homeobox DNA-binding domain.                   DR    SMART; SM00351; PAX; 1.
RA   Wei F.;                                                                  CC   -!- SIMILARITY: Contains 1 paired domain.                                 DR    SUPFAM; SSF46689; Homeodomain_like; 2.
RT   "Cloning the homologic isoform gene pax6 5a in the rat.";                CC   -----------------------------------------------------------------------   DR    PROSITE; PS00027; HOMEOBOX_1; 1.
RL   Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases.                 CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms   DR    PROSITE; PS50071; HOMEOBOX_2; 1.
RN    [4]                                                                     CC   Distributed under the Creative Commons Attribution-NoDerivs License       DR    PROSITE; PS00034; PAIRED_1; 1.
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1).                      CC   -----------------------------------------------------------------------   DR    PROSITE; PS51057; PAIRED_2; 1.
RC   TISSUE=Heart;                                                            DR   EMBL; U69644; AAB09042.1; -; mRNA.                                        PE    2: Evidence at transcript level;
RX   PubMed=15489334; DOI=10.1101/gr.2596504;                                 DR   EMBL; AY540905; AAS48919.1; -; mRNA.                                      KW     Alternative splicing; Complete proteome; Developmental protein;
RG    The MGC Project Team;                                                   DR   EMBL; AY540906; AAS48920.1; -; mRNA.                                      KW     Differentiation; DNA-binding; Homeobox; Nucleus; Paired box;
RT   "The status, quality, and expansion of the NIH full-length cDNA          DR   EMBL; AJ627631; CAF29075.1; -; mRNA.                                      KW     Reference proteome; Transcription; Transcription regulation;
RT   project: the Mammalian Gene Collection (MGC).";                          DR   EMBL; BC128741; AAI28742.1; -; mRNA.                                      KW     Ubl conjugation.
RL   Genome Res. 14:2121-2127(2004).                                          DR   EMBL; S74393; AAB32671.1; ALT_TERM; mRNA.                                 FT    CHAIN       1 422 Paired box protein Pax-6.
RN    [5]                                                                     DR   IPI; IPI00231698; -.                                                      FT                        /FTId=PRO_0000050187.
RP   PARTIAL NUCLEOTIDE SEQUENCE [MRNA], AND INVOLVEMENT IN SEY.              DR   IPI; IPI00464480; -.                                                      FT    DOMAIN        4 130 Paired.
RC   STRAIN=Sprague-Dawley; TISSUE=Embryo;                                    DR   PIR; S36166; S36166.                                                      FT    DNA_BIND 210 269 Homeobox.
RX   MEDLINE=95072652; PubMed=7981749; DOI=10.1038/ng0493-299;                DR   RefSeq; NP_037133.1; NM_013001.2.                                         FT    COMPBIAS 131 209 Gln/Gly-rich.
RA   Matsuo T., Osumi-Yamashita N., Noji S., Ohuchi H., Koyama E.,            DR   UniGene; Rn.89724; -.                                                     FT    COMPBIAS 279 422 Pro/Ser/Thr-rich.
RA   Myokai F., Matsuo N., Taniguchi S., Doi H., Iseki S., Ninomiya Y.,       DR   ProteinModelPortal; P63016; -.                                            FT    VAR_SEQ 47 47 Q -> QTHADAKVQVLDSEN (in isoform 5a).
RA   Fujiwara M., Wantanabe T., Eto K.;                                       DR   SMR; P63016; 4-136, 211-278.                                              FT                        /FTId=VSP_011531.
RT   "A mutation in the Pax-6 gene in rat small eye is associated with        DR   STRING; P63016; -.                                                        FT    CONFLICT 159 159 R -> C (in Ref. 3; CAF29075).
RT   impaired migration of midbrain crest cells.";                            DR   Ensembl; ENSRNOT00000005882; ENSRNOP00000005882; ENSRNOG00000004410.      FT    CONFLICT 183 183 Q -> G (in Ref. 5; AAB32671).
RL   Nat. Genet. 3:299-304(1993).                                             DR   Ensembl; ENSRNOT00000006302; ENSRNOP00000006302; ENSRNOG00000004410.      SQ    SEQUENCE 422 AA; 46754 MW; B0B2E5C176A518FE CRC64;
RN    [6]                                                                     DR   GeneID; 25509; -.                                                              MQNSHSGVNQ LGGVFVNGRP LPDSTRQKIV ELAHSGARPC DISRILQVSN GCVSKILGRY
RP   FUNCTION.                                                                DR   KEGG; rno:25509; -.                                                            YETGSIRPRA IGGSKPRVAT PEVVSKIAQY KRECPSIFAW EIRDRLLSEG VCTNDNIPSV
RX   MEDLINE=21869997; PubMed=11880342;                                       DR   UCSC; RGD:3258; rat.                                                           SSINRVLRNL ASEKQQMGAD GMYDKLRMLN GQTGSWGTRP GWYPGTSVPG QPTQDGCQQQ
RA   Takahashi M., Osumi N.;                                                  DR   CTD; 5080; -.                                                                  EGQGENTNSI SSNGEDSDEA QMRLQLKRKL QRNRTSFTQE QIEALEKEFE RTHYPDVFAR
RT   "Pax6 regulates specification of ventral neurone subtypes in the         DR   RGD; 3258; Pax6.                                                               ERLAAKIDLP EARIQVWFSN RRAKWRREEK LRNQRRQASN TPSHIPISSS FSTSVYQPIP
RT   hindbrain by establishing progenitor domains.";                          DR   eggNOG; NOG326044; -.                                                          QPTTPVSSFT SGSMLGRTDT ALTNTYSALP PMPSFTMANN LPMQPPVPSQ TSSYSCMLPT
RL   Development 129:1327-1338(2002).                                         DR   GeneTree; ENSGT00650000093130; -.                                              SPSVNGRSYD TYTPPHMQTH MNSQPMGTSG TTSTGLISPG VSVPVQVPGS EPDMSQYWPR
                                                                              DR   HOVERGEN; HBG009115; -.                                                        LQ
                                                                              DR   KO; K08031; -.                                                            //
ID   PAX6_RAT             Reviewed;       422 AA.                             CC   -!- FUNCTION: Transcription factor with important functions in the        DR    GO; GO:0000790; C:nuclear chromatin; IDA:BHF-UCL.
AC   P63016; A1A5N7; P32117; P70601; Q62222; Q64037; Q6QHS5; Q701Q8;          CC      development of the eye, nose, central nervous system and pancreas.     DR    GO; GO:0003680; F:AT DNA binding; IDA:RGD.
DT   31-AUG-2004, integrated into UniProtKB/Swiss-Prot.                       CC      Required for the differentiation of pancreatic islet alpha cells.      DR    GO; GO:0003690; F:double-stranded DNA binding; IDA:RGD.
DT   31-AUG-2004, sequence version 1.                                         CC      Competes with PAX4 in binding to a common element in the glucagon,     DR    GO; GO:0000979; F:RNA polymerase II core promoter sequence-specific DNA binding; IC:BHF-UCL.
DT   11-JUL-2012, entry version 74.                                           CC      insulin and somatostatin promoters (By similarity). Regulates          DR    GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IC:BHF-U
DE   RecName: Full=Paired box protein Pax-6;                                  CC      specification of the ventral neuron subtypes by establishing the       DR    GO; GO:0004842; F:ubiquitin-protein ligase activity; ISS:UniProtKB.
DE   AltName: Full=Oculorhombin;                                              CC      correct progenitor domains.                                            DR    GO; GO:0030902; P:hindbrain development; IDA:RGD.
GN    Name=Pax6; Synonyms=Pax-6, Sey;                                         CC   -!- SUBUNIT: Interacts with MAF and MAFB (By similarity). Interacts       DR    GO; GO:0050768; P:negative regulation of neurogenesis; ISS:UniProtKB.
OS    Rattus norvegicus (Rat).                                                CC      with TRIM11; this interaction leads to ubiquitination and              DR    GO; GO:0001764; P:neuron migration; IMP:RGD.
OC    Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;       CC      proteasomal degradation, as well as inhibition of transactivation,     DR    GO; GO:0003322; P:pancreatic A cell development; IMP:BHF-UCL.
OC    Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi;   CC      possibly in part by preventing PAX6 binding to consensus DNA           DR    GO; GO:0042660; P:positive regulation of cell fate specification; IMP:RGD.
OC    Muroidea; Muridae; Murinae; Rattus.                                     CC      sequences (By similarity).                                             DR    GO; GO:0045893; P:positive regulation of transcription, DNA-dependent; IC:BHF-UCL.
OX   NCBI_TaxID=10116;                                                        CC   -!- SUBCELLULAR LOCATION: Nucleus (By similarity).                        DR    GO; GO:0050678; P:regulation of epithelial cell proliferation; IMP:RGD.
RN    [1]                                                                     CC   -!- ALTERNATIVE PRODUCTS:                                                 DR    GO; GO:0045664; P:regulation of neuron differentiation; IDA:RGD.
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1).                                  CC      Event=Alternative splicing; Named isoforms=2;                          DR    Gene3D; G3DSA:1.10.10.60; Homeodomain-rel; 1.
RA   Gimlich R., Arnold G.S., Wawersik S., Maas R., Wong G.;                  CC      Name=1;                                                                DR    Gene3D; G3DSA:1.10.10.10; Wing_hlx_DNA_bd; 2.
RT   "Pax-6 is required for pancreatic islet development.";                   CC        IsoId=P63016-1; Sequence=Displayed;                                  DR    InterPro; IPR017970; Homeobox_CS.
RL   Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases.                 CC      Name=5a; Synonyms=Pax6-5a;                                             DR    InterPro; IPR001356; Homeodomain.
RN    [2]                                                                     CC        IsoId=P63016-2; Sequence=VSP_011531;                                 DR    InterPro; IPR009057; Homeodomain-like.
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A).                                 CC   -!- PTM: Ubiquitinated by TRIM11, leading to ubiquitination and           DR    InterPro; IPR001523; Paired_box_N.
RC   STRAIN=New England Deaconess Hospital, and Sprague-Dawley;               CC      proteasomal degradation (By similarity).                               DR    InterPro; IPR011991; WHTH_trsnscrt_rep_DNA-bd.
RA   Karkour A., Wolf G.M., Walther R.;                                       CC   -!- DISEASE: Note=Defects in Pax6 are the cause of a condition known      DR    Pfam; PF00046; Homeobox; 1.
RL   Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases.                 CC      as small eye (Sey) which results in the complete lack of eyes and      DR    Pfam; PF00292; PAX; 1.
RN    [3]                                                                     CC      nasal primordia.                                                       DR    PRINTS; PR00027; PAIREDBOX.
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A).                                 CC   -!- SIMILARITY: Belongs to the paired homeobox family.                    DR    SMART; SM00389; HOX; 1.
RC   STRAIN=Sprague-Dawley; TISSUE=Brain;                                     CC   -!- SIMILARITY: Contains 1 homeobox DNA-binding domain.                   DR    SMART; SM00351; PAX; 1.
RA   Wei F.;                                                                  CC   -!- SIMILARITY: Contains 1 paired domain.                                 DR    SUPFAM; SSF46689; Homeodomain_like; 2.
RT   "Cloning the homologic isoform gene pax6 5a in the rat.";                CC   -----------------------------------------------------------------------   DR    PROSITE; PS00027; HOMEOBOX_1; 1.
RL   Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases.                 CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms   DR    PROSITE; PS50071; HOMEOBOX_2; 1.
RN    [4]                                                                     CC   Distributed under the Creative Commons Attribution-NoDerivs License       DR    PROSITE; PS00034; PAIRED_1; 1.
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1).                      CC   -----------------------------------------------------------------------   DR    PROSITE; PS51057; PAIRED_2; 1.
RC   TISSUE=Heart;                                                            DR   EMBL; U69644; AAB09042.1; -; mRNA.                                        PE    2: Evidence at transcript level;
RX   PubMed=15489334; DOI=10.1101/gr.2596504;                                 DR   EMBL; AY540905; AAS48919.1; -; mRNA.                                      KW     Alternative splicing; Complete proteome; Developmental protein;
RG    The MGC Project Team;                                                   DR   EMBL; AY540906; AAS48920.1; -; mRNA.                                      KW     Differentiation; DNA-binding; Homeobox; Nucleus; Paired box;
RT   "The status, quality, and expansion of the NIH full-length cDNA          DR   EMBL; AJ627631; CAF29075.1; -; mRNA.                                      KW     Reference proteome; Transcription; Transcription regulation;
RT   project: the Mammalian Gene Collection (MGC).";                          DR   EMBL; BC128741; AAI28742.1; -; mRNA.                                      KW     Ubl conjugation.
RL   Genome Res. 14:2121-2127(2004).                                          DR   EMBL; S74393; AAB32671.1; ALT_TERM; mRNA.                                 FT    CHAIN       1 422 Paired box protein Pax-6.
RN    [5]                                                                     DR   IPI; IPI00231698; -.                                                      FT                        /FTId=PRO_0000050187.
RP   PARTIAL NUCLEOTIDE SEQUENCE [MRNA], AND INVOLVEMENT IN SEY.              DR   IPI; IPI00464480; -.                                                      FT    DOMAIN        4 130 Paired.
RC   STRAIN=Sprague-Dawley; TISSUE=Embryo;                                    DR   PIR; S36166; S36166.                                                      FT    DNA_BIND 210 269 Homeobox.
RX   MEDLINE=95072652; PubMed=7981749; DOI=10.1038/ng0493-299;                DR   RefSeq; NP_037133.1; NM_013001.2.                                         FT    COMPBIAS 131 209 Gln/Gly-rich.
RA   Matsuo T., Osumi-Yamashita N., Noji S., Ohuchi H., Koyama E.,            DR   UniGene; Rn.89724; -.                                                     FT    COMPBIAS 279 422 Pro/Ser/Thr-rich.
RA   Myokai F., Matsuo N., Taniguchi S., Doi H., Iseki S., Ninomiya Y.,       DR   ProteinModelPortal; P63016; -.                                            FT    VAR_SEQ 47 47 Q -> QTHADAKVQVLDSEN (in isoform 5a).
RA   Fujiwara M., Wantanabe T., Eto K.;                                       DR   SMR; P63016; 4-136, 211-278.                                              FT                        /FTId=VSP_011531.
RT   "A mutation in the Pax-6 gene in rat small eye is associated with        DR   STRING; P63016; -.                                                        FT    CONFLICT 159 159 R -> C (in Ref. 3; CAF29075).
RT   impaired migration of midbrain crest cells.";                            DR   Ensembl; ENSRNOT00000005882; ENSRNOP00000005882; ENSRNOG00000004410.      FT    CONFLICT 183 183 Q -> G (in Ref. 5; AAB32671).
RL   Nat. Genet. 3:299-304(1993).                                             DR   Ensembl; ENSRNOT00000006302; ENSRNOP00000006302; ENSRNOG00000004410.      SQ    SEQUENCE 422 AA; 46754 MW; B0B2E5C176A518FE CRC64;
RN    [6]                                                                     DR   GeneID; 25509; -.                                                              MQNSHSGVNQ LGGVFVNGRP LPDSTRQKIV ELAHSGARPC DISRILQVSN GCVSKILGRY
RP   FUNCTION.                                                                DR   KEGG; rno:25509; -.                                                            YETGSIRPRA IGGSKPRVAT PEVVSKIAQY KRECPSIFAW EIRDRLLSEG VCTNDNIPSV
RX   MEDLINE=21869997; PubMed=11880342;                                       DR   UCSC; RGD:3258; rat.                                                           SSINRVLRNL ASEKQQMGAD GMYDKLRMLN GQTGSWGTRP GWYPGTSVPG QPTQDGCQQQ
RA   Takahashi M., Osumi N.;                                                  DR   CTD; 5080; -.                                                                  EGQGENTNSI SSNGEDSDEA QMRLQLKRKL QRNRTSFTQE QIEALEKEFE RTHYPDVFAR
RT   "Pax6 regulates specification of ventral neurone subtypes in the         DR   RGD; 3258; Pax6.                                                               ERLAAKIDLP EARIQVWFSN RRAKWRREEK LRNQRRQASN TPSHIPISSS FSTSVYQPIP
RT   hindbrain by establishing progenitor domains.";                          DR   eggNOG; NOG326044; -.                                                          QPTTPVSSFT SGSMLGRTDT ALTNTYSALP PMPSFTMANN LPMQPPVPSQ TSSYSCMLPT
RL   Development 129:1327-1338(2002).                                         DR   GeneTree; ENSGT00650000093130; -.                                              SPSVNGRSYD TYTPPHMQTH MNSQPMGTSG TTSTGLISPG VSVPVQVPGS EPDMSQYWPR
                                                                              DR   HOVERGEN; HBG009115; -.                                                        LQ
                                                                              DR   KO; K08031; -.                                                            //
ID   PAX6_RAT             Reviewed;       422 AA.                             CC   -!- FUNCTION: Transcription factor with important functions in the        DR    GO; GO:0000790; C:nuclear chromatin; IDA:BHF-UCL.
AC   P63016; A1A5N7; P32117; P70601; Q62222; Q64037; Q6QHS5; Q701Q8;          CC      development of the eye, nose, central nervous system and pancreas.     DR    GO; GO:0003680; F:AT DNA binding; IDA:RGD.
DT   31-AUG-2004, integrated into UniProtKB/Swiss-Prot.                       CC      Required for the differentiation of pancreatic islet alpha cells.      DR    GO; GO:0003690; F:double-stranded DNA binding; IDA:RGD.
DT   31-AUG-2004, sequence version 1.                                         CC      Competes with PAX4 in binding to a common element in the glucagon,     DR    GO; GO:0000979; F:RNA polymerase II core promoter sequence-specific DNA binding; IC:BHF-UCL.
DT   11-JUL-2012, entry version 74.                                           CC      insulin and somatostatin promoters (By similarity). Regulates          DR    GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IC:BHF-U
DE   RecName: Full=Paired box protein Pax-6;                                  CC      specification of the ventral neuron subtypes by establishing the       DR    GO; GO:0004842; F:ubiquitin-protein ligase activity; ISS:UniProtKB.
DE   AltName: Full=Oculorhombin;                                              CC      correct progenitor domains.                                            DR    GO; GO:0030902; P:hindbrain development; IDA:RGD.
GN    Name=Pax6; Synonyms=Pax-6, Sey;                                         CC   -!- SUBUNIT: Interacts with MAF and MAFB (By similarity). Interacts       DR    GO; GO:0050768; P:negative regulation of neurogenesis; ISS:UniProtKB.
OS    Rattus norvegicus (Rat).                                                CC      with TRIM11; this interaction leads to ubiquitination and              DR    GO; GO:0001764; P:neuron migration; IMP:RGD.
OC    Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;       CC      proteasomal degradation, as well as inhibition of transactivation,     DR    GO; GO:0003322; P:pancreatic A cell development; IMP:BHF-UCL.
OC    Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi;   CC      possibly in part by preventing PAX6 binding to consensus DNA           DR    GO; GO:0042660; P:positive regulation of cell fate specification; IMP:RGD.
OC    Muroidea; Muridae; Murinae; Rattus.                                     CC      sequences (By similarity).                                             DR    GO; GO:0045893; P:positive regulation of transcription, DNA-dependent; IC:BHF-UCL.
OX   NCBI_TaxID=10116;                                                        CC   -!- SUBCELLULAR LOCATION: Nucleus (By similarity).                        DR    GO; GO:0050678; P:regulation of epithelial cell proliferation; IMP:RGD.
RN    [1]                                                                     CC   -!- ALTERNATIVE PRODUCTS:                                                 DR    GO; GO:0045664; P:regulation of neuron differentiation; IDA:RGD.
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1).                                  CC      Event=Alternative splicing; Named isoforms=2;                          DR    Gene3D; G3DSA:1.10.10.60; Homeodomain-rel; 1.
RA   Gimlich R., Arnold G.S., Wawersik S., Maas R., Wong G.;                  CC      Name=1;                                                                DR    Gene3D; G3DSA:1.10.10.10; Wing_hlx_DNA_bd; 2.
RT   "Pax-6 is required for pancreatic islet development.";                   CC        IsoId=P63016-1; Sequence=Displayed;                                  DR    InterPro; IPR017970; Homeobox_CS.
RL   Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases.                 CC      Name=5a; Synonyms=Pax6-5a;                                             DR    InterPro; IPR001356; Homeodomain.
RN    [2]                                                                     CC        IsoId=P63016-2; Sequence=VSP_011531;                                 DR    InterPro; IPR009057; Homeodomain-like.
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A).                                 CC   -!- PTM: Ubiquitinated by TRIM11, leading to ubiquitination and           DR    InterPro; IPR001523; Paired_box_N.
RC   STRAIN=New England Deaconess Hospital, and Sprague-Dawley;               CC      proteasomal degradation (By similarity).                               DR    InterPro; IPR011991; WHTH_trsnscrt_rep_DNA-bd.
RA   Karkour A., Wolf G.M., Walther R.;                                       CC   -!- DISEASE: Note=Defects in Pax6 are the cause of a condition known      DR    Pfam; PF00046; Homeobox; 1.
RL   Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases.                 CC      as small eye (Sey) which results in the complete lack of eyes and      DR    Pfam; PF00292; PAX; 1.
RN    [3]                                                                     CC      nasal primordia.                                                       DR    PRINTS; PR00027; PAIREDBOX.
RP   NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A).                                 CC   -!- SIMILARITY: Belongs to the paired homeobox family.                    DR    SMART; SM00389; HOX; 1.
RC   STRAIN=Sprague-Dawley; TISSUE=Brain;                                     CC   -!- SIMILARITY: Contains 1 homeobox DNA-binding domain.                   DR    SMART; SM00351; PAX; 1.
RA   Wei F.;                                                                  CC   -!- SIMILARITY: Contains 1 paired domain.                                 DR    SUPFAM; SSF46689; Homeodomain_like; 2.
RT   "Cloning the homologic isoform gene pax6 5a in the rat.";                CC   -----------------------------------------------------------------------   DR    PROSITE; PS00027; HOMEOBOX_1; 1.
RL   Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases.                 CC   Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms   DR    PROSITE; PS50071; HOMEOBOX_2; 1.
RN    [4]                                                                     CC   Distributed under the Creative Commons Attribution-NoDerivs License       DR    PROSITE; PS00034; PAIRED_1; 1.
RP   NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1).                      CC   -----------------------------------------------------------------------   DR    PROSITE; PS51057; PAIRED_2; 1.
RC   TISSUE=Heart;                                                            DR   EMBL; U69644; AAB09042.1; -; mRNA.                                        PE    2: Evidence at transcript level;
RX   PubMed=15489334; DOI=10.1101/gr.2596504;                                 DR   EMBL; AY540905; AAS48919.1; -; mRNA.                                      KW     Alternative splicing; Complete proteome; Developmental protein;
RG    The MGC Project Team;                                                   DR   EMBL; AY540906; AAS48920.1; -; mRNA.                                      KW     Differentiation; DNA-binding; Homeobox; Nucleus; Paired box;
RT   "The status, quality, and expansion of the NIH full-length cDNA          DR   EMBL; AJ627631; CAF29075.1; -; mRNA.                                      KW     Reference proteome; Transcription; Transcription regulation;
RT   project: the Mammalian Gene Collection (MGC).";                          DR   EMBL; BC128741; AAI28742.1; -; mRNA.                                      KW     Ubl conjugation.
RL   Genome Res. 14:2121-2127(2004).                                          DR   EMBL; S74393; AAB32671.1; ALT_TERM; mRNA.                                 FT    CHAIN       1 422 Paired box protein Pax-6.
RN    [5]                                                                     DR   IPI; IPI00231698; -.                                                      FT                        /FTId=PRO_0000050187.
RP   PARTIAL NUCLEOTIDE SEQUENCE [MRNA], AND INVOLVEMENT IN SEY.              DR   IPI; IPI00464480; -.                                                      FT    DOMAIN        4 130 Paired.
RC   STRAIN=Sprague-Dawley; TISSUE=Embryo;                                    DR   PIR; S36166; S36166.                                                      FT    DNA_BIND 210 269 Homeobox.
RX   MEDLINE=95072652; PubMed=7981749; DOI=10.1038/ng0493-299;                DR   RefSeq; NP_037133.1; NM_013001.2.                                         FT    COMPBIAS 131 209 Gln/Gly-rich.
RA   Matsuo T., Osumi-Yamashita N., Noji S., Ohuchi H., Koyama E.,            DR   UniGene; Rn.89724; -.                                                     FT    COMPBIAS 279 422 Pro/Ser/Thr-rich.
RA   Myokai F., Matsuo N., Taniguchi S., Doi H., Iseki S., Ninomiya Y.,       DR   ProteinModelPortal; P63016; -.                                            FT    VAR_SEQ 47 47 Q -> QTHADAKVQVLDSEN (in isoform 5a).
RA   Fujiwara M., Wantanabe T., Eto K.;                                       DR   SMR; P63016; 4-136, 211-278.                                              FT                        /FTId=VSP_011531.
RT   "A mutation in the Pax-6 gene in rat small eye is associated with        DR   STRING; P63016; -.                                                        FT    CONFLICT 159 159 R -> C (in Ref. 3; CAF29075).
RT   impaired migration of midbrain crest cells.";                            DR   Ensembl; ENSRNOT00000005882; ENSRNOP00000005882; ENSRNOG00000004410.      FT    CONFLICT 183 183 Q -> G (in Ref. 5; AAB32671).
RL   Nat. Genet. 3:299-304(1993).                                             DR   Ensembl; ENSRNOT00000006302; ENSRNOP00000006302; ENSRNOG00000004410.      SQ    SEQUENCE 422 AA; 46754 MW; B0B2E5C176A518FE CRC64;
RN    [6]                                                                     DR   GeneID; 25509; -.                                                              MQNSHSGVNQ LGGVFVNGRP LPDSTRQKIV ELAHSGARPC DISRILQVSN GCVSKILGRY
RP   FUNCTION.                                                                DR   KEGG; rno:25509; -.                                                            YETGSIRPRA IGGSKPRVAT PEVVSKIAQY KRECPSIFAW EIRDRLLSEG VCTNDNIPSV
RX   MEDLINE=21869997; PubMed=11880342;                                       DR   UCSC; RGD:3258; rat.                                                           SSINRVLRNL ASEKQQMGAD GMYDKLRMLN GQTGSWGTRP GWYPGTSVPG QPTQDGCQQQ
RA   Takahashi M., Osumi N.;                                                  DR   CTD; 5080; -.                                                                  EGQGENTNSI SSNGEDSDEA QMRLQLKRKL QRNRTSFTQE QIEALEKEFE RTHYPDVFAR
RT   "Pax6 regulates specification of ventral neurone subtypes in the         DR   RGD; 3258; Pax6.                                                               ERLAAKIDLP EARIQVWFSN RRAKWRREEK LRNQRRQASN TPSHIPISSS FSTSVYQPIP
RT   hindbrain by establishing progenitor domains.";                          DR   eggNOG; NOG326044; -.                                                          QPTTPVSSFT SGSMLGRTDT ALTNTYSALP PMPSFTMANN LPMQPPVPSQ TSSYSCMLPT
RL   Development 129:1327-1338(2002).                                         DR   GeneTree; ENSGT00650000093130; -.                                              SPSVNGRSYD TYTPPHMQTH MNSQPMGTSG TTSTGLISPG VSVPVQVPGS EPDMSQYWPR
                                                                              DR   HOVERGEN; HBG009115; -.                                                        LQ
                                                                              DR   KO; K08031; -.                                                            //
Functional Annotation
• Annotation is overloaded:
       – Here we mean “high level”
              • Knowledge associated with the data
              • Aimed at the human reader




Michael J Bell @mj_bell
                                 Newcastle University   7
m.j.bell1@ncl.ac.uk
Michael J Bell @mj_bell
                          Newcastle University   8
m.j.bell1@ncl.ac.uk
Swiss-Prot Entry
                          P26367 – PAX6_HUMAN
                             (Homo sapiens)
                              43 Sentences



Michael J Bell @mj_bell
                              Newcastle University   9
m.j.bell1@ncl.ac.uk
Michael J Bell @mj_bell
                          Newcastle University   10
m.j.bell1@ncl.ac.uk
TrEMBL Entry
                          A4PBK5 – A4PBK5_9METZ
                            (Ephydatia fluviatilis)
                                1 Sentence



Michael J Bell @mj_bell
                                  Newcastle University   11
m.j.bell1@ncl.ac.uk
Annotation Quality
• Annotation is highly variable
       – E.g. Automated Vs. Manual
• Current approaches rely upon specific
  database structure/features
       – Ontology
       – Evidence Codes
• Can we develop a metric based on free text?


Michael J Bell @mj_bell
                                Newcastle University   12
m.j.bell1@ncl.ac.uk
Why UniProtKB?
• UniProtKB is well known and established
• Number of technical reasons:
       – UniProtKB composed of TrEMBL and Swiss-Prot
       – Historical version
       – Cross species
• Lack of gold standard



Michael J Bell @mj_bell
                              Newcastle University     13
m.j.bell1@ncl.ac.uk
Applying Power Laws




Michael J Bell @mj_bell
                                Newcastle University   14
m.j.bell1@ncl.ac.uk
Investigating Word Occurrences
• Extract word occurrence from all annotation




Michael J Bell @mj_bell
                          Newcastle University   15
m.j.bell1@ncl.ac.uk
Investigating Word Occurrences
• Extract word occurrence from all annotation



                          1. Protein
                              2. Proteins
                                  3.   Chains
                                        4.   Chain
                                                5.   Sequence
                                                      6.   Enzyme
                                                               7.   Complex




Michael J Bell @mj_bell
                                        Newcastle University                  16
m.j.bell1@ncl.ac.uk
Word Occurrences in Wikipedia




                                     Taken from: http://en.wikipedia.org/wiki/File:Wikipedia-n-zipf.png

Michael J Bell @mj_bell
                          Newcastle University                                                   17
m.j.bell1@ncl.ac.uk
Zipf’s Principle of Least Effort
• Take word occurrences and apply to Zipf’s
  Principle of Least Effort
• Human nature to take path of least effort
  when achieving a goal
 α Value           Examples in literature                                           Least effort for
    α < 1.6        Advanced Schizophrenia, Young children                           -

  1.6 < α < 2      Military Combat Texts, Wikipedia, Web pages listed on the open   Annotator
                   directory project
     α=2           Single author texts                                              Equal
  2 < α < 2.4      Multi author texts                                               Audience
    α > 2.4        Fragmented discourse schizophrenia                               -


Michael J Bell @mj_bell
                                              Newcastle University                                 18
m.j.bell1@ncl.ac.uk
Data Extraction




Michael J Bell @mj_bell
                              Newcastle University   19
m.j.bell1@ncl.ac.uk
The Model & Resulting Graphs
• Power Law Distribution
• Logarithmic scales
• X-axis – Size
• Y-Axis – Probability
• A point represents
  probability a word will
  occur X or more times
• E.g. upper left most point:
       – Probability word occurs once = 10^0
Michael J Bell @mj_bell
                           Newcastle University   20
m.j.bell1@ncl.ac.uk
Does UniProtKB obey a power-law?
• Broadly, yes. However, distinct structure?




Michael J Bell @mj_bell
                          Newcastle University   21
m.j.bell1@ncl.ac.uk
The removal of copyright
• Development of two slopes
       – As seen in mature resources




Michael J Bell @mj_bell
                           Newcastle University   22
m.j.bell1@ncl.ac.uk
Quality of Biological Knowledge?
• How does automated annotation compare to
  manual annotation?
       – i.e. TrEMBL Vs. Swiss-Prot
• Assume Swiss-Prot acts as a more mature
  resource than TrEMBL
• Analyse this by comparing annotations at
  equivalent points in time


Michael J Bell @mj_bell
                           Newcastle University   23
m.j.bell1@ncl.ac.uk
Viewing over time




Michael J Bell @mj_bell
                               Newcastle University   24
m.j.bell1@ncl.ac.uk
Viewing over time
• Show just alpha
  values
• Appears to be
  becoming
  optimised (least
  effort) for
  annotator


Michael J Bell @mj_bell
                               Newcastle University   25
m.j.bell1@ncl.ac.uk
Annotation Maturity
• Does this decrease happen because entries
  are, on average, getting older?




Michael J Bell @mj_bell
                                Newcastle University   26
m.j.bell1@ncl.ac.uk
Annotation Maturity
• Want to abstract from size and analyse how
  individual records are maturing
• Need essentially a set of records which relate
  to a defined set of proteins
• Therefore extract entries common in both
  Swiss-Prot V9 and UniProtKB V15



Michael J Bell @mj_bell
                                Newcastle University   27
m.j.bell1@ncl.ac.uk
Annotation maturity




Michael J Bell @mj_bell
                                Newcastle University   28
m.j.bell1@ncl.ac.uk
Analysing new annotations
• Mature entries are decreasing
• How are new annotations impacted?
• Take annotations from entries that appear for
  the first time in a given database version




Michael J Bell @mj_bell
                          Newcastle University    29
m.j.bell1@ncl.ac.uk
The impact of new annotations




Michael J Bell @mj_bell
                          Newcastle University   30
m.j.bell1@ncl.ac.uk
Explanation for the decrease?
• Annotation curation involves identifying
  similar entries
• Annotations between these entries are
  standardised
• Is this standardisation changing the way
  entries are annotated?
       – Subsequently placing the least effort onto the
         annotator?

Michael J Bell @mj_bell
                           Newcastle University           31
m.j.bell1@ncl.ac.uk
Conclusions
• Approach acting as a quality measure
       – Detection of artefacts
       – Distinction between TrEMBL and Swiss-Prot
• Annotations in UniProtKB are becoming
  optimised for the annotator rather than the
  reader
       – Constant increase of data & pressure on curators
       – Also true for existing and new annotations

Michael J Bell @mj_bell
                            Newcastle University            32
m.j.bell1@ncl.ac.uk
Summary
• The biological community lacks a generic quality
  metric that allows biological annotation to be
  quantitatively assessed and compared.
• Here we investigated word reuse within bulk
  textual annotation and related it to Zipf's
  Principle of Least Effort.
• Straight forward approach once data extracted
• Holds promise of being useful for curators and
  end users
Michael J Bell @mj_bell
                           Newcastle University      33
m.j.bell1@ncl.ac.uk
Colin Gillespie, Daniel Swan

Thank You!                                                          and Phillip Lord




Many thanks go to:
  Allyson Lister1
  Daniel Barrell2
                                                           Michael Bell
  UniProt Helpdesk
1 Newcastle
                                                        m.j.bell1@ncl.ac.uk
               University, UK
2 EBIMichael J Bell @mj_bell
     m.j.bell1@ncl.ac.uk
                                Newcastle University   www.michaeljbell.co.uk
                                                                           34

Weitere ähnliche Inhalte

Was ist angesagt?

Prasanth Kumar Nadh Dehydrogenase Subunit 1
Prasanth Kumar Nadh Dehydrogenase Subunit 1Prasanth Kumar Nadh Dehydrogenase Subunit 1
Prasanth Kumar Nadh Dehydrogenase Subunit 1Prasanthperceptron
 
In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus
In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus
In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus Yordan Penev
 
Targeted T-cell receptor beta immune repertoire sequencing in several FFPE ti...
Targeted T-cell receptor beta immune repertoire sequencing in several FFPE ti...Targeted T-cell receptor beta immune repertoire sequencing in several FFPE ti...
Targeted T-cell receptor beta immune repertoire sequencing in several FFPE ti...Thermo Fisher Scientific
 
ER stress video abstract v2
ER stress video abstract v2ER stress video abstract v2
ER stress video abstract v2Eric Huang
 
Jaspreet presentation
Jaspreet presentationJaspreet presentation
Jaspreet presentationJaspreet Kaur
 
Sars co v-2 polymerase inhibition with remdesivir
Sars co v-2 polymerase inhibition with remdesivirSars co v-2 polymerase inhibition with remdesivir
Sars co v-2 polymerase inhibition with remdesivirRamachandra Barik
 
Tissue Culture and Cloning Work
Tissue Culture and Cloning WorkTissue Culture and Cloning Work
Tissue Culture and Cloning WorkSatrupa Das
 
Approach for limited cell ChIP-Seq on a semiconductor-based sequencing platform
Approach for limited cell ChIP-Seq on a semiconductor-based sequencing platformApproach for limited cell ChIP-Seq on a semiconductor-based sequencing platform
Approach for limited cell ChIP-Seq on a semiconductor-based sequencing platformThermo Fisher Scientific
 
TransVax Phase 2 Trial Results
TransVax Phase 2 Trial ResultsTransVax Phase 2 Trial Results
TransVax Phase 2 Trial ResultsVicalInc
 
Семинар ДНК 16/05/2014 Сибэнзим
Семинар ДНК 16/05/2014 СибэнзимСеминар ДНК 16/05/2014 Сибэнзим
Семинар ДНК 16/05/2014 СибэнзимRuslan Titov
 
Machula Thesis Proposal
Machula Thesis ProposalMachula Thesis Proposal
Machula Thesis ProposalJason Machula
 
Irvin PLoS Genet 2014
Irvin PLoS Genet 2014Irvin PLoS Genet 2014
Irvin PLoS Genet 2014Jordan Irvin
 

Was ist angesagt? (20)

Prasanth Kumar Nadh Dehydrogenase Subunit 1
Prasanth Kumar Nadh Dehydrogenase Subunit 1Prasanth Kumar Nadh Dehydrogenase Subunit 1
Prasanth Kumar Nadh Dehydrogenase Subunit 1
 
In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus
In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus
In Vitro Characterization of a Novel Cis-acting Element (NCE) in the Cd4 Locus
 
GKA deel 1 college 4
GKA deel 1 college 4GKA deel 1 college 4
GKA deel 1 college 4
 
Targeted T-cell receptor beta immune repertoire sequencing in several FFPE ti...
Targeted T-cell receptor beta immune repertoire sequencing in several FFPE ti...Targeted T-cell receptor beta immune repertoire sequencing in several FFPE ti...
Targeted T-cell receptor beta immune repertoire sequencing in several FFPE ti...
 
ER stress video abstract v2
ER stress video abstract v2ER stress video abstract v2
ER stress video abstract v2
 
Jaspreet presentation
Jaspreet presentationJaspreet presentation
Jaspreet presentation
 
NiH_Presentation
NiH_PresentationNiH_Presentation
NiH_Presentation
 
Sars co v-2 polymerase inhibition with remdesivir
Sars co v-2 polymerase inhibition with remdesivirSars co v-2 polymerase inhibition with remdesivir
Sars co v-2 polymerase inhibition with remdesivir
 
p21 mechanism slide
p21 mechanism slidep21 mechanism slide
p21 mechanism slide
 
Tissue Culture and Cloning Work
Tissue Culture and Cloning WorkTissue Culture and Cloning Work
Tissue Culture and Cloning Work
 
17
1717
17
 
1-s2.0-S0167488913002401-main
1-s2.0-S0167488913002401-main1-s2.0-S0167488913002401-main
1-s2.0-S0167488913002401-main
 
Dna
DnaDna
Dna
 
Approach for limited cell ChIP-Seq on a semiconductor-based sequencing platform
Approach for limited cell ChIP-Seq on a semiconductor-based sequencing platformApproach for limited cell ChIP-Seq on a semiconductor-based sequencing platform
Approach for limited cell ChIP-Seq on a semiconductor-based sequencing platform
 
TransVax Phase 2 Trial Results
TransVax Phase 2 Trial ResultsTransVax Phase 2 Trial Results
TransVax Phase 2 Trial Results
 
Семинар ДНК 16/05/2014 Сибэнзим
Семинар ДНК 16/05/2014 СибэнзимСеминар ДНК 16/05/2014 Сибэнзим
Семинар ДНК 16/05/2014 Сибэнзим
 
Crispr
CrisprCrispr
Crispr
 
Machula Thesis Proposal
Machula Thesis ProposalMachula Thesis Proposal
Machula Thesis Proposal
 
Homology directed repair (HDR) Knock-in
Homology directed repair (HDR) Knock-inHomology directed repair (HDR) Knock-in
Homology directed repair (HDR) Knock-in
 
Irvin PLoS Genet 2014
Irvin PLoS Genet 2014Irvin PLoS Genet 2014
Irvin PLoS Genet 2014
 

Andere mochten auch (6)

Sequences
SequencesSequences
Sequences
 
Protein Database
Protein DatabaseProtein Database
Protein Database
 
Protein database
Protein databaseProtein database
Protein database
 
Protein databases
Protein databasesProtein databases
Protein databases
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
Sequencing
SequencingSequencing
Sequencing
 

Ähnlich wie An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

RNA-Seq To Identify Novel Markers For Research on Neural Tissue Differentiation
RNA-Seq To Identify Novel Markers For Research on Neural Tissue DifferentiationRNA-Seq To Identify Novel Markers For Research on Neural Tissue Differentiation
RNA-Seq To Identify Novel Markers For Research on Neural Tissue DifferentiationThermo Fisher Scientific
 
Prp Presentation
Prp PresentationPrp Presentation
Prp Presentationnathanjcobb
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsThomas Keane
 
1-s2.0-037811199390549I-main
1-s2.0-037811199390549I-main1-s2.0-037811199390549I-main
1-s2.0-037811199390549I-mainTeresa Zimny
 
Sima lev: Lipid Transfer Proteins and Membrane Contact Sites in Human Cancer
Sima lev: Lipid Transfer Proteins and Membrane Contact Sites in Human CancerSima lev: Lipid Transfer Proteins and Membrane Contact Sites in Human Cancer
Sima lev: Lipid Transfer Proteins and Membrane Contact Sites in Human CancerSima Lev
 
Modulation of MMP and ADAM gene expression in human chondrocytes by IL-1 and OSM
Modulation of MMP and ADAM gene expression in human chondrocytes by IL-1 and OSMModulation of MMP and ADAM gene expression in human chondrocytes by IL-1 and OSM
Modulation of MMP and ADAM gene expression in human chondrocytes by IL-1 and OSMpjtkoshy
 
Homo sapiens (human pepsin) NCBI GENBANK
Homo sapiens (human pepsin) NCBI GENBANKHomo sapiens (human pepsin) NCBI GENBANK
Homo sapiens (human pepsin) NCBI GENBANKShreyaBhatt23
 
Clinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidanceClinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidanceNikesh Shah
 
KDM5 epigenetic modifiers as a focus for drug discovery
KDM5 epigenetic modifiers as a focus for drug discoveryKDM5 epigenetic modifiers as a focus for drug discovery
KDM5 epigenetic modifiers as a focus for drug discoveryChristopher Wynder
 
Phosphoproteomics of collagen receptor networks reveals SHP-2
Phosphoproteomics of collagen receptor networks reveals SHP-2Phosphoproteomics of collagen receptor networks reveals SHP-2
Phosphoproteomics of collagen receptor networks reveals SHP-2Maciej Luczynski
 
Vincent Timmerman - 'Neuropatías periféricas hereditarias'
Vincent Timmerman - 'Neuropatías periféricas hereditarias'Vincent Timmerman - 'Neuropatías periféricas hereditarias'
Vincent Timmerman - 'Neuropatías periféricas hereditarias'Fundación Ramón Areces
 
A new effector pathway links ATM kinase with the DNA damage response
A new effector pathway links ATM kinase with the DNA damage responseA new effector pathway links ATM kinase with the DNA damage response
A new effector pathway links ATM kinase with the DNA damage responseCostas Demonacos
 
Kathleen big Poster 2016 final copy
Kathleen big Poster 2016 final copyKathleen big Poster 2016 final copy
Kathleen big Poster 2016 final copyKathleen Barakat
 
20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析
20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析
20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析Monascus2008
 

Ähnlich wie An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB (20)

Glyco special hart
Glyco special hartGlyco special hart
Glyco special hart
 
RNA-Seq To Identify Novel Markers For Research on Neural Tissue Differentiation
RNA-Seq To Identify Novel Markers For Research on Neural Tissue DifferentiationRNA-Seq To Identify Novel Markers For Research on Neural Tissue Differentiation
RNA-Seq To Identify Novel Markers For Research on Neural Tissue Differentiation
 
GENETIC VARIATION IN GPCR
GENETIC VARIATION IN GPCRGENETIC VARIATION IN GPCR
GENETIC VARIATION IN GPCR
 
Prp Presentation
Prp PresentationPrp Presentation
Prp Presentation
 
Structure of p53 protein
Structure of p53 proteinStructure of p53 protein
Structure of p53 protein
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
 
1-s2.0-037811199390549I-main
1-s2.0-037811199390549I-main1-s2.0-037811199390549I-main
1-s2.0-037811199390549I-main
 
Sima lev: Lipid Transfer Proteins and Membrane Contact Sites in Human Cancer
Sima lev: Lipid Transfer Proteins and Membrane Contact Sites in Human CancerSima lev: Lipid Transfer Proteins and Membrane Contact Sites in Human Cancer
Sima lev: Lipid Transfer Proteins and Membrane Contact Sites in Human Cancer
 
Modulation of MMP and ADAM gene expression in human chondrocytes by IL-1 and OSM
Modulation of MMP and ADAM gene expression in human chondrocytes by IL-1 and OSMModulation of MMP and ADAM gene expression in human chondrocytes by IL-1 and OSM
Modulation of MMP and ADAM gene expression in human chondrocytes by IL-1 and OSM
 
Homo sapiens (human pepsin) NCBI GENBANK
Homo sapiens (human pepsin) NCBI GENBANKHomo sapiens (human pepsin) NCBI GENBANK
Homo sapiens (human pepsin) NCBI GENBANK
 
Clinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidanceClinical molecular diagnostics for drug guidance
Clinical molecular diagnostics for drug guidance
 
KDM5 epigenetic modifiers as a focus for drug discovery
KDM5 epigenetic modifiers as a focus for drug discoveryKDM5 epigenetic modifiers as a focus for drug discovery
KDM5 epigenetic modifiers as a focus for drug discovery
 
Phosphoproteomics of collagen receptor networks reveals SHP-2
Phosphoproteomics of collagen receptor networks reveals SHP-2Phosphoproteomics of collagen receptor networks reveals SHP-2
Phosphoproteomics of collagen receptor networks reveals SHP-2
 
GKA deel 1 college 9
GKA deel 1 college 9GKA deel 1 college 9
GKA deel 1 college 9
 
Vincent Timmerman - 'Neuropatías periféricas hereditarias'
Vincent Timmerman - 'Neuropatías periféricas hereditarias'Vincent Timmerman - 'Neuropatías periféricas hereditarias'
Vincent Timmerman - 'Neuropatías periféricas hereditarias'
 
Collegepart B.Burgering Deel 2
Collegepart B.Burgering Deel 2Collegepart B.Burgering Deel 2
Collegepart B.Burgering Deel 2
 
A new effector pathway links ATM kinase with the DNA damage response
A new effector pathway links ATM kinase with the DNA damage responseA new effector pathway links ATM kinase with the DNA damage response
A new effector pathway links ATM kinase with the DNA damage response
 
Dlx5
Dlx5Dlx5
Dlx5
 
Kathleen big Poster 2016 final copy
Kathleen big Poster 2016 final copyKathleen big Poster 2016 final copy
Kathleen big Poster 2016 final copy
 
20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析
20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析
20081217 05邵彥春 與紅麴菌菌絲發育相關基因的克隆及序列分析
 

Kürzlich hochgeladen

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

  • 1. An approach to describe and analyse bulk annotation quality Michael J Bell*, Colin Gillespie, Daniel Swan and Phillip Lord *m.j.bell1@ncl.ac.uk www.michaeljbell.co.uk
  • 2. Talk Outline • Annotation Quality? Why UniProtKB? • Data extraction • Applying power laws • Analysing Swiss-Prot and TrEMBL annotation • Discussion and Conclusion • Questions Michael J Bell @mj_bell Newcastle University 2 m.j.bell1@ncl.ac.uk
  • 3. Annotation Quality in UniProtKB Michael J Bell @mj_bell Newcastle University 3 m.j.bell1@ncl.ac.uk
  • 4. ID PAX6_RAT Reviewed; 422 AA. CC -!- FUNCTION: Transcription factor with important functions in the DR GO; GO:0000790; C:nuclear chromatin; IDA:BHF-UCL. AC P63016; A1A5N7; P32117; P70601; Q62222; Q64037; Q6QHS5; Q701Q8; CC development of the eye, nose, central nervous system and pancreas. DR GO; GO:0003680; F:AT DNA binding; IDA:RGD. DT 31-AUG-2004, integrated into UniProtKB/Swiss-Prot. CC Required for the differentiation of pancreatic islet alpha cells. DR GO; GO:0003690; F:double-stranded DNA binding; IDA:RGD. DT 31-AUG-2004, sequence version 1. CC Competes with PAX4 in binding to a common element in the glucagon, DR GO; GO:0000979; F:RNA polymerase II core promoter sequence-specific DNA binding; IC:BHF-UCL. DT 11-JUL-2012, entry version 74. CC insulin and somatostatin promoters (By similarity). Regulates DR GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IC:BHF-U DE RecName: Full=Paired box protein Pax-6; CC specification of the ventral neuron subtypes by establishing the DR GO; GO:0004842; F:ubiquitin-protein ligase activity; ISS:UniProtKB. DE AltName: Full=Oculorhombin; CC correct progenitor domains. DR GO; GO:0030902; P:hindbrain development; IDA:RGD. GN Name=Pax6; Synonyms=Pax-6, Sey; CC -!- SUBUNIT: Interacts with MAF and MAFB (By similarity). Interacts DR GO; GO:0050768; P:negative regulation of neurogenesis; ISS:UniProtKB. OS Rattus norvegicus (Rat). CC with TRIM11; this interaction leads to ubiquitination and DR GO; GO:0001764; P:neuron migration; IMP:RGD. OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; CC proteasomal degradation, as well as inhibition of transactivation, DR GO; GO:0003322; P:pancreatic A cell development; IMP:BHF-UCL. OC Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; CC possibly in part by preventing PAX6 binding to consensus DNA DR GO; GO:0042660; P:positive regulation of cell fate specification; IMP:RGD. OC Muroidea; Muridae; Murinae; Rattus. CC sequences (By similarity). DR GO; GO:0045893; P:positive regulation of transcription, DNA-dependent; IC:BHF-UCL. OX NCBI_TaxID=10116; CC -!- SUBCELLULAR LOCATION: Nucleus (By similarity). DR GO; GO:0050678; P:regulation of epithelial cell proliferation; IMP:RGD. RN [1] CC -!- ALTERNATIVE PRODUCTS: DR GO; GO:0045664; P:regulation of neuron differentiation; IDA:RGD. RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1). CC Event=Alternative splicing; Named isoforms=2; DR Gene3D; G3DSA:1.10.10.60; Homeodomain-rel; 1. RA Gimlich R., Arnold G.S., Wawersik S., Maas R., Wong G.; CC Name=1; DR Gene3D; G3DSA:1.10.10.10; Wing_hlx_DNA_bd; 2. RT "Pax-6 is required for pancreatic islet development."; CC IsoId=P63016-1; Sequence=Displayed; DR InterPro; IPR017970; Homeobox_CS. RL Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases. CC Name=5a; Synonyms=Pax6-5a; DR InterPro; IPR001356; Homeodomain. RN [2] CC IsoId=P63016-2; Sequence=VSP_011531; DR InterPro; IPR009057; Homeodomain-like. RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- PTM: Ubiquitinated by TRIM11, leading to ubiquitination and DR InterPro; IPR001523; Paired_box_N. RC STRAIN=New England Deaconess Hospital, and Sprague-Dawley; CC proteasomal degradation (By similarity). DR InterPro; IPR011991; WHTH_trsnscrt_rep_DNA-bd. RA Karkour A., Wolf G.M., Walther R.; CC -!- DISEASE: Note=Defects in Pax6 are the cause of a condition known DR Pfam; PF00046; Homeobox; 1. RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC as small eye (Sey) which results in the complete lack of eyes and DR Pfam; PF00292; PAX; 1. RN [3] CC nasal primordia. DR PRINTS; PR00027; PAIREDBOX. RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- SIMILARITY: Belongs to the paired homeobox family. DR SMART; SM00389; HOX; 1. RC STRAIN=Sprague-Dawley; TISSUE=Brain; CC -!- SIMILARITY: Contains 1 homeobox DNA-binding domain. DR SMART; SM00351; PAX; 1. RA Wei F.; CC -!- SIMILARITY: Contains 1 paired domain. DR SUPFAM; SSF46689; Homeodomain_like; 2. RT "Cloning the homologic isoform gene pax6 5a in the rat."; CC ----------------------------------------------------------------------- DR PROSITE; PS00027; HOMEOBOX_1; 1. RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms DR PROSITE; PS50071; HOMEOBOX_2; 1. RN [4] CC Distributed under the Creative Commons Attribution-NoDerivs License DR PROSITE; PS00034; PAIRED_1; 1. RP NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1). CC ----------------------------------------------------------------------- DR PROSITE; PS51057; PAIRED_2; 1. RC TISSUE=Heart; DR EMBL; U69644; AAB09042.1; -; mRNA. PE 2: Evidence at transcript level; RX PubMed=15489334; DOI=10.1101/gr.2596504; DR EMBL; AY540905; AAS48919.1; -; mRNA. KW Alternative splicing; Complete proteome; Developmental protein; RG The MGC Project Team; DR EMBL; AY540906; AAS48920.1; -; mRNA. KW Differentiation; DNA-binding; Homeobox; Nucleus; Paired box; RT "The status, quality, and expansion of the NIH full-length cDNA DR EMBL; AJ627631; CAF29075.1; -; mRNA. KW Reference proteome; Transcription; Transcription regulation; RT project: the Mammalian Gene Collection (MGC)."; DR EMBL; BC128741; AAI28742.1; -; mRNA. KW Ubl conjugation. RL Genome Res. 14:2121-2127(2004). DR EMBL; S74393; AAB32671.1; ALT_TERM; mRNA. FT CHAIN 1 422 Paired box protein Pax-6. RN [5] DR IPI; IPI00231698; -. FT /FTId=PRO_0000050187. RP PARTIAL NUCLEOTIDE SEQUENCE [MRNA], AND INVOLVEMENT IN SEY. DR IPI; IPI00464480; -. FT DOMAIN 4 130 Paired. RC STRAIN=Sprague-Dawley; TISSUE=Embryo; DR PIR; S36166; S36166. FT DNA_BIND 210 269 Homeobox. RX MEDLINE=95072652; PubMed=7981749; DOI=10.1038/ng0493-299; DR RefSeq; NP_037133.1; NM_013001.2. FT COMPBIAS 131 209 Gln/Gly-rich. RA Matsuo T., Osumi-Yamashita N., Noji S., Ohuchi H., Koyama E., DR UniGene; Rn.89724; -. FT COMPBIAS 279 422 Pro/Ser/Thr-rich. RA Myokai F., Matsuo N., Taniguchi S., Doi H., Iseki S., Ninomiya Y., DR ProteinModelPortal; P63016; -. FT VAR_SEQ 47 47 Q -> QTHADAKVQVLDSEN (in isoform 5a). RA Fujiwara M., Wantanabe T., Eto K.; DR SMR; P63016; 4-136, 211-278. FT /FTId=VSP_011531. RT "A mutation in the Pax-6 gene in rat small eye is associated with DR STRING; P63016; -. FT CONFLICT 159 159 R -> C (in Ref. 3; CAF29075). RT impaired migration of midbrain crest cells."; DR Ensembl; ENSRNOT00000005882; ENSRNOP00000005882; ENSRNOG00000004410. FT CONFLICT 183 183 Q -> G (in Ref. 5; AAB32671). RL Nat. Genet. 3:299-304(1993). DR Ensembl; ENSRNOT00000006302; ENSRNOP00000006302; ENSRNOG00000004410. SQ SEQUENCE 422 AA; 46754 MW; B0B2E5C176A518FE CRC64; RN [6] DR GeneID; 25509; -. MQNSHSGVNQ LGGVFVNGRP LPDSTRQKIV ELAHSGARPC DISRILQVSN GCVSKILGRY RP FUNCTION. DR KEGG; rno:25509; -. YETGSIRPRA IGGSKPRVAT PEVVSKIAQY KRECPSIFAW EIRDRLLSEG VCTNDNIPSV RX MEDLINE=21869997; PubMed=11880342; DR UCSC; RGD:3258; rat. SSINRVLRNL ASEKQQMGAD GMYDKLRMLN GQTGSWGTRP GWYPGTSVPG QPTQDGCQQQ RA Takahashi M., Osumi N.; DR CTD; 5080; -. EGQGENTNSI SSNGEDSDEA QMRLQLKRKL QRNRTSFTQE QIEALEKEFE RTHYPDVFAR RT "Pax6 regulates specification of ventral neurone subtypes in the DR RGD; 3258; Pax6. ERLAAKIDLP EARIQVWFSN RRAKWRREEK LRNQRRQASN TPSHIPISSS FSTSVYQPIP RT hindbrain by establishing progenitor domains."; DR eggNOG; NOG326044; -. QPTTPVSSFT SGSMLGRTDT ALTNTYSALP PMPSFTMANN LPMQPPVPSQ TSSYSCMLPT RL Development 129:1327-1338(2002). DR GeneTree; ENSGT00650000093130; -. SPSVNGRSYD TYTPPHMQTH MNSQPMGTSG TTSTGLISPG VSVPVQVPGS EPDMSQYWPR DR HOVERGEN; HBG009115; -. LQ DR KO; K08031; -. //
  • 5. ID PAX6_RAT Reviewed; 422 AA. CC -!- FUNCTION: Transcription factor with important functions in the DR GO; GO:0000790; C:nuclear chromatin; IDA:BHF-UCL. AC P63016; A1A5N7; P32117; P70601; Q62222; Q64037; Q6QHS5; Q701Q8; CC development of the eye, nose, central nervous system and pancreas. DR GO; GO:0003680; F:AT DNA binding; IDA:RGD. DT 31-AUG-2004, integrated into UniProtKB/Swiss-Prot. CC Required for the differentiation of pancreatic islet alpha cells. DR GO; GO:0003690; F:double-stranded DNA binding; IDA:RGD. DT 31-AUG-2004, sequence version 1. CC Competes with PAX4 in binding to a common element in the glucagon, DR GO; GO:0000979; F:RNA polymerase II core promoter sequence-specific DNA binding; IC:BHF-UCL. DT 11-JUL-2012, entry version 74. CC insulin and somatostatin promoters (By similarity). Regulates DR GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IC:BHF-U DE RecName: Full=Paired box protein Pax-6; CC specification of the ventral neuron subtypes by establishing the DR GO; GO:0004842; F:ubiquitin-protein ligase activity; ISS:UniProtKB. DE AltName: Full=Oculorhombin; CC correct progenitor domains. DR GO; GO:0030902; P:hindbrain development; IDA:RGD. GN Name=Pax6; Synonyms=Pax-6, Sey; CC -!- SUBUNIT: Interacts with MAF and MAFB (By similarity). Interacts DR GO; GO:0050768; P:negative regulation of neurogenesis; ISS:UniProtKB. OS Rattus norvegicus (Rat). CC with TRIM11; this interaction leads to ubiquitination and DR GO; GO:0001764; P:neuron migration; IMP:RGD. OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; CC proteasomal degradation, as well as inhibition of transactivation, DR GO; GO:0003322; P:pancreatic A cell development; IMP:BHF-UCL. OC Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; CC possibly in part by preventing PAX6 binding to consensus DNA DR GO; GO:0042660; P:positive regulation of cell fate specification; IMP:RGD. OC Muroidea; Muridae; Murinae; Rattus. CC sequences (By similarity). DR GO; GO:0045893; P:positive regulation of transcription, DNA-dependent; IC:BHF-UCL. OX NCBI_TaxID=10116; CC -!- SUBCELLULAR LOCATION: Nucleus (By similarity). DR GO; GO:0050678; P:regulation of epithelial cell proliferation; IMP:RGD. RN [1] CC -!- ALTERNATIVE PRODUCTS: DR GO; GO:0045664; P:regulation of neuron differentiation; IDA:RGD. RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1). CC Event=Alternative splicing; Named isoforms=2; DR Gene3D; G3DSA:1.10.10.60; Homeodomain-rel; 1. RA Gimlich R., Arnold G.S., Wawersik S., Maas R., Wong G.; CC Name=1; DR Gene3D; G3DSA:1.10.10.10; Wing_hlx_DNA_bd; 2. RT "Pax-6 is required for pancreatic islet development."; CC IsoId=P63016-1; Sequence=Displayed; DR InterPro; IPR017970; Homeobox_CS. RL Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases. CC Name=5a; Synonyms=Pax6-5a; DR InterPro; IPR001356; Homeodomain. RN [2] CC IsoId=P63016-2; Sequence=VSP_011531; DR InterPro; IPR009057; Homeodomain-like. RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- PTM: Ubiquitinated by TRIM11, leading to ubiquitination and DR InterPro; IPR001523; Paired_box_N. RC STRAIN=New England Deaconess Hospital, and Sprague-Dawley; CC proteasomal degradation (By similarity). DR InterPro; IPR011991; WHTH_trsnscrt_rep_DNA-bd. RA Karkour A., Wolf G.M., Walther R.; CC -!- DISEASE: Note=Defects in Pax6 are the cause of a condition known DR Pfam; PF00046; Homeobox; 1. RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC as small eye (Sey) which results in the complete lack of eyes and DR Pfam; PF00292; PAX; 1. RN [3] CC nasal primordia. DR PRINTS; PR00027; PAIREDBOX. RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- SIMILARITY: Belongs to the paired homeobox family. DR SMART; SM00389; HOX; 1. RC STRAIN=Sprague-Dawley; TISSUE=Brain; CC -!- SIMILARITY: Contains 1 homeobox DNA-binding domain. DR SMART; SM00351; PAX; 1. RA Wei F.; CC -!- SIMILARITY: Contains 1 paired domain. DR SUPFAM; SSF46689; Homeodomain_like; 2. RT "Cloning the homologic isoform gene pax6 5a in the rat."; CC ----------------------------------------------------------------------- DR PROSITE; PS00027; HOMEOBOX_1; 1. RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms DR PROSITE; PS50071; HOMEOBOX_2; 1. RN [4] CC Distributed under the Creative Commons Attribution-NoDerivs License DR PROSITE; PS00034; PAIRED_1; 1. RP NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1). CC ----------------------------------------------------------------------- DR PROSITE; PS51057; PAIRED_2; 1. RC TISSUE=Heart; DR EMBL; U69644; AAB09042.1; -; mRNA. PE 2: Evidence at transcript level; RX PubMed=15489334; DOI=10.1101/gr.2596504; DR EMBL; AY540905; AAS48919.1; -; mRNA. KW Alternative splicing; Complete proteome; Developmental protein; RG The MGC Project Team; DR EMBL; AY540906; AAS48920.1; -; mRNA. KW Differentiation; DNA-binding; Homeobox; Nucleus; Paired box; RT "The status, quality, and expansion of the NIH full-length cDNA DR EMBL; AJ627631; CAF29075.1; -; mRNA. KW Reference proteome; Transcription; Transcription regulation; RT project: the Mammalian Gene Collection (MGC)."; DR EMBL; BC128741; AAI28742.1; -; mRNA. KW Ubl conjugation. RL Genome Res. 14:2121-2127(2004). DR EMBL; S74393; AAB32671.1; ALT_TERM; mRNA. FT CHAIN 1 422 Paired box protein Pax-6. RN [5] DR IPI; IPI00231698; -. FT /FTId=PRO_0000050187. RP PARTIAL NUCLEOTIDE SEQUENCE [MRNA], AND INVOLVEMENT IN SEY. DR IPI; IPI00464480; -. FT DOMAIN 4 130 Paired. RC STRAIN=Sprague-Dawley; TISSUE=Embryo; DR PIR; S36166; S36166. FT DNA_BIND 210 269 Homeobox. RX MEDLINE=95072652; PubMed=7981749; DOI=10.1038/ng0493-299; DR RefSeq; NP_037133.1; NM_013001.2. FT COMPBIAS 131 209 Gln/Gly-rich. RA Matsuo T., Osumi-Yamashita N., Noji S., Ohuchi H., Koyama E., DR UniGene; Rn.89724; -. FT COMPBIAS 279 422 Pro/Ser/Thr-rich. RA Myokai F., Matsuo N., Taniguchi S., Doi H., Iseki S., Ninomiya Y., DR ProteinModelPortal; P63016; -. FT VAR_SEQ 47 47 Q -> QTHADAKVQVLDSEN (in isoform 5a). RA Fujiwara M., Wantanabe T., Eto K.; DR SMR; P63016; 4-136, 211-278. FT /FTId=VSP_011531. RT "A mutation in the Pax-6 gene in rat small eye is associated with DR STRING; P63016; -. FT CONFLICT 159 159 R -> C (in Ref. 3; CAF29075). RT impaired migration of midbrain crest cells."; DR Ensembl; ENSRNOT00000005882; ENSRNOP00000005882; ENSRNOG00000004410. FT CONFLICT 183 183 Q -> G (in Ref. 5; AAB32671). RL Nat. Genet. 3:299-304(1993). DR Ensembl; ENSRNOT00000006302; ENSRNOP00000006302; ENSRNOG00000004410. SQ SEQUENCE 422 AA; 46754 MW; B0B2E5C176A518FE CRC64; RN [6] DR GeneID; 25509; -. MQNSHSGVNQ LGGVFVNGRP LPDSTRQKIV ELAHSGARPC DISRILQVSN GCVSKILGRY RP FUNCTION. DR KEGG; rno:25509; -. YETGSIRPRA IGGSKPRVAT PEVVSKIAQY KRECPSIFAW EIRDRLLSEG VCTNDNIPSV RX MEDLINE=21869997; PubMed=11880342; DR UCSC; RGD:3258; rat. SSINRVLRNL ASEKQQMGAD GMYDKLRMLN GQTGSWGTRP GWYPGTSVPG QPTQDGCQQQ RA Takahashi M., Osumi N.; DR CTD; 5080; -. EGQGENTNSI SSNGEDSDEA QMRLQLKRKL QRNRTSFTQE QIEALEKEFE RTHYPDVFAR RT "Pax6 regulates specification of ventral neurone subtypes in the DR RGD; 3258; Pax6. ERLAAKIDLP EARIQVWFSN RRAKWRREEK LRNQRRQASN TPSHIPISSS FSTSVYQPIP RT hindbrain by establishing progenitor domains."; DR eggNOG; NOG326044; -. QPTTPVSSFT SGSMLGRTDT ALTNTYSALP PMPSFTMANN LPMQPPVPSQ TSSYSCMLPT RL Development 129:1327-1338(2002). DR GeneTree; ENSGT00650000093130; -. SPSVNGRSYD TYTPPHMQTH MNSQPMGTSG TTSTGLISPG VSVPVQVPGS EPDMSQYWPR DR HOVERGEN; HBG009115; -. LQ DR KO; K08031; -. //
  • 6. ID PAX6_RAT Reviewed; 422 AA. CC -!- FUNCTION: Transcription factor with important functions in the DR GO; GO:0000790; C:nuclear chromatin; IDA:BHF-UCL. AC P63016; A1A5N7; P32117; P70601; Q62222; Q64037; Q6QHS5; Q701Q8; CC development of the eye, nose, central nervous system and pancreas. DR GO; GO:0003680; F:AT DNA binding; IDA:RGD. DT 31-AUG-2004, integrated into UniProtKB/Swiss-Prot. CC Required for the differentiation of pancreatic islet alpha cells. DR GO; GO:0003690; F:double-stranded DNA binding; IDA:RGD. DT 31-AUG-2004, sequence version 1. CC Competes with PAX4 in binding to a common element in the glucagon, DR GO; GO:0000979; F:RNA polymerase II core promoter sequence-specific DNA binding; IC:BHF-UCL. DT 11-JUL-2012, entry version 74. CC insulin and somatostatin promoters (By similarity). Regulates DR GO; GO:0000981; F:sequence-specific DNA binding RNA polymerase II transcription factor activity; IC:BHF-U DE RecName: Full=Paired box protein Pax-6; CC specification of the ventral neuron subtypes by establishing the DR GO; GO:0004842; F:ubiquitin-protein ligase activity; ISS:UniProtKB. DE AltName: Full=Oculorhombin; CC correct progenitor domains. DR GO; GO:0030902; P:hindbrain development; IDA:RGD. GN Name=Pax6; Synonyms=Pax-6, Sey; CC -!- SUBUNIT: Interacts with MAF and MAFB (By similarity). Interacts DR GO; GO:0050768; P:negative regulation of neurogenesis; ISS:UniProtKB. OS Rattus norvegicus (Rat). CC with TRIM11; this interaction leads to ubiquitination and DR GO; GO:0001764; P:neuron migration; IMP:RGD. OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; CC proteasomal degradation, as well as inhibition of transactivation, DR GO; GO:0003322; P:pancreatic A cell development; IMP:BHF-UCL. OC Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; CC possibly in part by preventing PAX6 binding to consensus DNA DR GO; GO:0042660; P:positive regulation of cell fate specification; IMP:RGD. OC Muroidea; Muridae; Murinae; Rattus. CC sequences (By similarity). DR GO; GO:0045893; P:positive regulation of transcription, DNA-dependent; IC:BHF-UCL. OX NCBI_TaxID=10116; CC -!- SUBCELLULAR LOCATION: Nucleus (By similarity). DR GO; GO:0050678; P:regulation of epithelial cell proliferation; IMP:RGD. RN [1] CC -!- ALTERNATIVE PRODUCTS: DR GO; GO:0045664; P:regulation of neuron differentiation; IDA:RGD. RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 1). CC Event=Alternative splicing; Named isoforms=2; DR Gene3D; G3DSA:1.10.10.60; Homeodomain-rel; 1. RA Gimlich R., Arnold G.S., Wawersik S., Maas R., Wong G.; CC Name=1; DR Gene3D; G3DSA:1.10.10.10; Wing_hlx_DNA_bd; 2. RT "Pax-6 is required for pancreatic islet development."; CC IsoId=P63016-1; Sequence=Displayed; DR InterPro; IPR017970; Homeobox_CS. RL Submitted (SEP-1996) to the EMBL/GenBank/DDBJ databases. CC Name=5a; Synonyms=Pax6-5a; DR InterPro; IPR001356; Homeodomain. RN [2] CC IsoId=P63016-2; Sequence=VSP_011531; DR InterPro; IPR009057; Homeodomain-like. RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- PTM: Ubiquitinated by TRIM11, leading to ubiquitination and DR InterPro; IPR001523; Paired_box_N. RC STRAIN=New England Deaconess Hospital, and Sprague-Dawley; CC proteasomal degradation (By similarity). DR InterPro; IPR011991; WHTH_trsnscrt_rep_DNA-bd. RA Karkour A., Wolf G.M., Walther R.; CC -!- DISEASE: Note=Defects in Pax6 are the cause of a condition known DR Pfam; PF00046; Homeobox; 1. RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC as small eye (Sey) which results in the complete lack of eyes and DR Pfam; PF00292; PAX; 1. RN [3] CC nasal primordia. DR PRINTS; PR00027; PAIREDBOX. RP NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 5A). CC -!- SIMILARITY: Belongs to the paired homeobox family. DR SMART; SM00389; HOX; 1. RC STRAIN=Sprague-Dawley; TISSUE=Brain; CC -!- SIMILARITY: Contains 1 homeobox DNA-binding domain. DR SMART; SM00351; PAX; 1. RA Wei F.; CC -!- SIMILARITY: Contains 1 paired domain. DR SUPFAM; SSF46689; Homeodomain_like; 2. RT "Cloning the homologic isoform gene pax6 5a in the rat."; CC ----------------------------------------------------------------------- DR PROSITE; PS00027; HOMEOBOX_1; 1. RL Submitted (FEB-2004) to the EMBL/GenBank/DDBJ databases. CC Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms DR PROSITE; PS50071; HOMEOBOX_2; 1. RN [4] CC Distributed under the Creative Commons Attribution-NoDerivs License DR PROSITE; PS00034; PAIRED_1; 1. RP NUCLEOTIDE SEQUENCE [LARGE SCALE MRNA] (ISOFORM 1). CC ----------------------------------------------------------------------- DR PROSITE; PS51057; PAIRED_2; 1. RC TISSUE=Heart; DR EMBL; U69644; AAB09042.1; -; mRNA. PE 2: Evidence at transcript level; RX PubMed=15489334; DOI=10.1101/gr.2596504; DR EMBL; AY540905; AAS48919.1; -; mRNA. KW Alternative splicing; Complete proteome; Developmental protein; RG The MGC Project Team; DR EMBL; AY540906; AAS48920.1; -; mRNA. KW Differentiation; DNA-binding; Homeobox; Nucleus; Paired box; RT "The status, quality, and expansion of the NIH full-length cDNA DR EMBL; AJ627631; CAF29075.1; -; mRNA. KW Reference proteome; Transcription; Transcription regulation; RT project: the Mammalian Gene Collection (MGC)."; DR EMBL; BC128741; AAI28742.1; -; mRNA. KW Ubl conjugation. RL Genome Res. 14:2121-2127(2004). DR EMBL; S74393; AAB32671.1; ALT_TERM; mRNA. FT CHAIN 1 422 Paired box protein Pax-6. RN [5] DR IPI; IPI00231698; -. FT /FTId=PRO_0000050187. RP PARTIAL NUCLEOTIDE SEQUENCE [MRNA], AND INVOLVEMENT IN SEY. DR IPI; IPI00464480; -. FT DOMAIN 4 130 Paired. RC STRAIN=Sprague-Dawley; TISSUE=Embryo; DR PIR; S36166; S36166. FT DNA_BIND 210 269 Homeobox. RX MEDLINE=95072652; PubMed=7981749; DOI=10.1038/ng0493-299; DR RefSeq; NP_037133.1; NM_013001.2. FT COMPBIAS 131 209 Gln/Gly-rich. RA Matsuo T., Osumi-Yamashita N., Noji S., Ohuchi H., Koyama E., DR UniGene; Rn.89724; -. FT COMPBIAS 279 422 Pro/Ser/Thr-rich. RA Myokai F., Matsuo N., Taniguchi S., Doi H., Iseki S., Ninomiya Y., DR ProteinModelPortal; P63016; -. FT VAR_SEQ 47 47 Q -> QTHADAKVQVLDSEN (in isoform 5a). RA Fujiwara M., Wantanabe T., Eto K.; DR SMR; P63016; 4-136, 211-278. FT /FTId=VSP_011531. RT "A mutation in the Pax-6 gene in rat small eye is associated with DR STRING; P63016; -. FT CONFLICT 159 159 R -> C (in Ref. 3; CAF29075). RT impaired migration of midbrain crest cells."; DR Ensembl; ENSRNOT00000005882; ENSRNOP00000005882; ENSRNOG00000004410. FT CONFLICT 183 183 Q -> G (in Ref. 5; AAB32671). RL Nat. Genet. 3:299-304(1993). DR Ensembl; ENSRNOT00000006302; ENSRNOP00000006302; ENSRNOG00000004410. SQ SEQUENCE 422 AA; 46754 MW; B0B2E5C176A518FE CRC64; RN [6] DR GeneID; 25509; -. MQNSHSGVNQ LGGVFVNGRP LPDSTRQKIV ELAHSGARPC DISRILQVSN GCVSKILGRY RP FUNCTION. DR KEGG; rno:25509; -. YETGSIRPRA IGGSKPRVAT PEVVSKIAQY KRECPSIFAW EIRDRLLSEG VCTNDNIPSV RX MEDLINE=21869997; PubMed=11880342; DR UCSC; RGD:3258; rat. SSINRVLRNL ASEKQQMGAD GMYDKLRMLN GQTGSWGTRP GWYPGTSVPG QPTQDGCQQQ RA Takahashi M., Osumi N.; DR CTD; 5080; -. EGQGENTNSI SSNGEDSDEA QMRLQLKRKL QRNRTSFTQE QIEALEKEFE RTHYPDVFAR RT "Pax6 regulates specification of ventral neurone subtypes in the DR RGD; 3258; Pax6. ERLAAKIDLP EARIQVWFSN RRAKWRREEK LRNQRRQASN TPSHIPISSS FSTSVYQPIP RT hindbrain by establishing progenitor domains."; DR eggNOG; NOG326044; -. QPTTPVSSFT SGSMLGRTDT ALTNTYSALP PMPSFTMANN LPMQPPVPSQ TSSYSCMLPT RL Development 129:1327-1338(2002). DR GeneTree; ENSGT00650000093130; -. SPSVNGRSYD TYTPPHMQTH MNSQPMGTSG TTSTGLISPG VSVPVQVPGS EPDMSQYWPR DR HOVERGEN; HBG009115; -. LQ DR KO; K08031; -. //
  • 7. Functional Annotation • Annotation is overloaded: – Here we mean “high level” • Knowledge associated with the data • Aimed at the human reader Michael J Bell @mj_bell Newcastle University 7 m.j.bell1@ncl.ac.uk
  • 8. Michael J Bell @mj_bell Newcastle University 8 m.j.bell1@ncl.ac.uk
  • 9. Swiss-Prot Entry P26367 – PAX6_HUMAN (Homo sapiens) 43 Sentences Michael J Bell @mj_bell Newcastle University 9 m.j.bell1@ncl.ac.uk
  • 10. Michael J Bell @mj_bell Newcastle University 10 m.j.bell1@ncl.ac.uk
  • 11. TrEMBL Entry A4PBK5 – A4PBK5_9METZ (Ephydatia fluviatilis) 1 Sentence Michael J Bell @mj_bell Newcastle University 11 m.j.bell1@ncl.ac.uk
  • 12. Annotation Quality • Annotation is highly variable – E.g. Automated Vs. Manual • Current approaches rely upon specific database structure/features – Ontology – Evidence Codes • Can we develop a metric based on free text? Michael J Bell @mj_bell Newcastle University 12 m.j.bell1@ncl.ac.uk
  • 13. Why UniProtKB? • UniProtKB is well known and established • Number of technical reasons: – UniProtKB composed of TrEMBL and Swiss-Prot – Historical version – Cross species • Lack of gold standard Michael J Bell @mj_bell Newcastle University 13 m.j.bell1@ncl.ac.uk
  • 14. Applying Power Laws Michael J Bell @mj_bell Newcastle University 14 m.j.bell1@ncl.ac.uk
  • 15. Investigating Word Occurrences • Extract word occurrence from all annotation Michael J Bell @mj_bell Newcastle University 15 m.j.bell1@ncl.ac.uk
  • 16. Investigating Word Occurrences • Extract word occurrence from all annotation 1. Protein 2. Proteins 3. Chains 4. Chain 5. Sequence 6. Enzyme 7. Complex Michael J Bell @mj_bell Newcastle University 16 m.j.bell1@ncl.ac.uk
  • 17. Word Occurrences in Wikipedia Taken from: http://en.wikipedia.org/wiki/File:Wikipedia-n-zipf.png Michael J Bell @mj_bell Newcastle University 17 m.j.bell1@ncl.ac.uk
  • 18. Zipf’s Principle of Least Effort • Take word occurrences and apply to Zipf’s Principle of Least Effort • Human nature to take path of least effort when achieving a goal α Value Examples in literature Least effort for α < 1.6 Advanced Schizophrenia, Young children - 1.6 < α < 2 Military Combat Texts, Wikipedia, Web pages listed on the open Annotator directory project α=2 Single author texts Equal 2 < α < 2.4 Multi author texts Audience α > 2.4 Fragmented discourse schizophrenia - Michael J Bell @mj_bell Newcastle University 18 m.j.bell1@ncl.ac.uk
  • 19. Data Extraction Michael J Bell @mj_bell Newcastle University 19 m.j.bell1@ncl.ac.uk
  • 20. The Model & Resulting Graphs • Power Law Distribution • Logarithmic scales • X-axis – Size • Y-Axis – Probability • A point represents probability a word will occur X or more times • E.g. upper left most point: – Probability word occurs once = 10^0 Michael J Bell @mj_bell Newcastle University 20 m.j.bell1@ncl.ac.uk
  • 21. Does UniProtKB obey a power-law? • Broadly, yes. However, distinct structure? Michael J Bell @mj_bell Newcastle University 21 m.j.bell1@ncl.ac.uk
  • 22. The removal of copyright • Development of two slopes – As seen in mature resources Michael J Bell @mj_bell Newcastle University 22 m.j.bell1@ncl.ac.uk
  • 23. Quality of Biological Knowledge? • How does automated annotation compare to manual annotation? – i.e. TrEMBL Vs. Swiss-Prot • Assume Swiss-Prot acts as a more mature resource than TrEMBL • Analyse this by comparing annotations at equivalent points in time Michael J Bell @mj_bell Newcastle University 23 m.j.bell1@ncl.ac.uk
  • 24. Viewing over time Michael J Bell @mj_bell Newcastle University 24 m.j.bell1@ncl.ac.uk
  • 25. Viewing over time • Show just alpha values • Appears to be becoming optimised (least effort) for annotator Michael J Bell @mj_bell Newcastle University 25 m.j.bell1@ncl.ac.uk
  • 26. Annotation Maturity • Does this decrease happen because entries are, on average, getting older? Michael J Bell @mj_bell Newcastle University 26 m.j.bell1@ncl.ac.uk
  • 27. Annotation Maturity • Want to abstract from size and analyse how individual records are maturing • Need essentially a set of records which relate to a defined set of proteins • Therefore extract entries common in both Swiss-Prot V9 and UniProtKB V15 Michael J Bell @mj_bell Newcastle University 27 m.j.bell1@ncl.ac.uk
  • 28. Annotation maturity Michael J Bell @mj_bell Newcastle University 28 m.j.bell1@ncl.ac.uk
  • 29. Analysing new annotations • Mature entries are decreasing • How are new annotations impacted? • Take annotations from entries that appear for the first time in a given database version Michael J Bell @mj_bell Newcastle University 29 m.j.bell1@ncl.ac.uk
  • 30. The impact of new annotations Michael J Bell @mj_bell Newcastle University 30 m.j.bell1@ncl.ac.uk
  • 31. Explanation for the decrease? • Annotation curation involves identifying similar entries • Annotations between these entries are standardised • Is this standardisation changing the way entries are annotated? – Subsequently placing the least effort onto the annotator? Michael J Bell @mj_bell Newcastle University 31 m.j.bell1@ncl.ac.uk
  • 32. Conclusions • Approach acting as a quality measure – Detection of artefacts – Distinction between TrEMBL and Swiss-Prot • Annotations in UniProtKB are becoming optimised for the annotator rather than the reader – Constant increase of data & pressure on curators – Also true for existing and new annotations Michael J Bell @mj_bell Newcastle University 32 m.j.bell1@ncl.ac.uk
  • 33. Summary • The biological community lacks a generic quality metric that allows biological annotation to be quantitatively assessed and compared. • Here we investigated word reuse within bulk textual annotation and related it to Zipf's Principle of Least Effort. • Straight forward approach once data extracted • Holds promise of being useful for curators and end users Michael J Bell @mj_bell Newcastle University 33 m.j.bell1@ncl.ac.uk
  • 34. Colin Gillespie, Daniel Swan Thank You! and Phillip Lord Many thanks go to: Allyson Lister1 Daniel Barrell2 Michael Bell UniProt Helpdesk 1 Newcastle m.j.bell1@ncl.ac.uk University, UK 2 EBIMichael J Bell @mj_bell m.j.bell1@ncl.ac.uk Newcastle University www.michaeljbell.co.uk 34

Hinweis der Redaktion

  1. For example this is an analysis over Wikipedia. And we find that word occurrence size ranked by the word broadly obeys a power law. Taken from - http://en.wikipedia.org/wiki/File:Wikipedia-n-zipf.png
  2. We can relate these power laws to Zipf&apos;s principle of least effort. This states that... Point about reader and author. Different texts resolve this in different ways. By taking the exponenet of the regression line – alpha – we can see that Wikipedia has is least effort is placed on the curator.
  3. The first step of our approach is to extract the necessary data from UniProtKB. Our extraction process consists of 4 key steps. Firstly we obtain each version of Swiss-Prot and TrEMBL and then extract just those lines that hold comments. We then extract all the words from this data, and remove topic and block headings. We can then count how frequently each word occurs, with the output being a list of all words and their occurrence.
  4. We can then apply a power law distribution to this data. The result of which is a graph, as shown here. The graph is actually represented as a cumulative distribution function, and is shown on logarithmic scales. Along the X axis we have the size of a word – that is how frequently it occurs, whilst along the Y axis we have the probability of a word occurring X or more times. This graph isn’t straight forward – so as an example, the top left point represents that the probability a word occurs once is 1, as only words that occur within the corpus are used. Conversely, the point at the bottom right shows that the probability of a word occuring over 100,000 times is very small. Using this approach we can now initially apply it to Swiss-Prot
  5. The first question to ask is – does Swiss-Prot obey a power law? And it does boradly appear to, yes. However, there is a distinct structure or kink in the tail of the graph in a number of versions. So the question here is, what is this kink?
  6. Copyright statement added to every entry in a version. Therefore we see these statements here.Sort out graphsIt turns out to be copyright statements. This shows that using this approach we can detect the introduction of data with no biological significance. It also shows that our approach is acting as a measure of quality, albeit for detecting artifacts.
  7. Although Swiss-Prot obeys a power-law, does it relate to the quality of biological knowledge? One way we can address this quesiton is to compare automated and manual annotation.As shown previously, we can make the assumption that swiss-prot acts as a more mature resource than trembl. So by analysing annotations at similar points in time between the two resources, we would expect swiss-prot to act as a more mature resource.
  8. By overlaying the graphs for TrEMBL and Swiss-Prot we can more clearly see how they mature over time. It is clear from this slide that they appear to diverge over time, with TrEMBL showing higher levels or re-use and swiss-prot showing a richer use of vocabulary. So it does indeed suggest our approach is acting as a measure of quality. However, the main analytical value from these graphs from the alpha value
  9. So we can show the alpha value over time for both swiss-prot and trembl. This also provides a clearer image of effort over time.We can see how both databases show a decrease over time – that is they appear to be becoming least effort for the annotator – although this progression is much more irregular in TrEMBL.This view shows two major disjuncts in TrEMBL – which appears to coincide with changes to the underlying annotation process in TrEMBL. One possible explanation for this decrease is due to the age of entries.
  10. So one possibility for this decrease is that entries, on average, getting older as the database is getting older. This isn’t the case however, entries are getting younger. This is mainly due to new records being added exponentially – outnumbering the old records.So we ask if the decrease happens because entries are, on average, getting older? However this isn&apos;t true as actually average age isn&apos;t getting older and is decreasing over time.
  11. So here we want to abstract form the size of the database and ask how are individual records maturing?This isn’t straight forward – essentially we need a set of records which relate to a defined set of proteinsTherefore, we extract those entries that are common in SWP 9 and UPSP 15, providing a span of over 20 years.
  12. Again, highlight slope of graph we are looking at – and that we are looking at the subsets.Like with the database as a whole, we again see a decrease. However, this isn’t as low as the remainder of the database.
  13. Re-iterate earlier question – and that this is another way to look at it. All annotations are approx of same age, as they are new.
  14. As we see a decrease in the mature entries – how do the new entries fare?Similarly they decrease over time, which is the same pattern as all other graphsWhy do we see all of these decreases...?
  15. The protocol was recently published (2011) and again shows advantage of using UniProtKB, as well documented.Need to be careful we don&apos;t say “standardisation == poor quality”, it is just something that can explain it, rather than a definitive answer. Rather, it is more likely that trying to be consistent has lost some of the more “personal” annotations to entries, and thus become more generic.TOO WORDY
  16. Try finish on a high here. Give a very quick and brief recap of the main idea and points, and how it is “easy” and could be useful for both curators and end users alike.
  17. Word Cloud is from UniProtKB/Swiss-Prot Version 15
  18. Number of competing models were considered. However, Power Law distribution gives a good balance between model parsimony and fit.Only deal with discrete power law distribution here – which has the probability mass function described.To fit the power-law distribution we followed a Bayesian paradigm.Xmin, determined using the BIC criteria, was set to 50 throughout.