SlideShare ist ein Scribd-Unternehmen logo
1 von 40
The annotation of Plant Proteins in
           UniProtKB
                     Michel Schneider

     Plant protein annotation program, Swiss-Prot group
               Swiss Institute of Bioinformatics
                     Geneva, Switzerland
                 Michel.Schneider@isb-sib.ch
1. The UniProt consortium and its products

2. Content of an entry in UniProtKB and manual curation

3. Complete proteomes and reference proteomes

4. Synchronization between UniProtKB and TAIR

5. Some statistics




        “Pioneers at the Heart of Science” 1998 – 2008
                         PAG XX, San Diego, January 15, 2012
The UniProt consortium




     “Pioneers at the Heart of Science” 1998 – 2008
                      PAG XX, San Diego, January 15, 2012
The missions of the UniProt consortium
Provide the scientific community with a resource of protein
sequence and functional annotation which has to be …


 comprehensive

 high quality

 and freely accessible


         “Pioneers at the Heart of Science” 1998 – 2008
                          PAG XX, San Diego, January 15, 2012
Four components to fulfill specific demands
                                   UniProtKB
                             Protein Knowledgebase
      UniRef
                              UniProtKB/Swiss-Prot                      UniMes
 Sequence clusters
                                   Reviewed                        Metagenomic and
    UniRef100
                                    (533’657 entries)
     UniRef90                                                        environmental
                       Manual curation                             sample sequences
     UniRef50
                                UniProtKB/Trembl
                                  Unreviewed
                                   (19 million entries)

                Automated annotation

      UniParc – Sequence archive contains current and obsolete sequences
                               (29.6 million sequences)

            “Pioneers at the Heart of Science” 1998 – 2008
                             PAG XX, San Diego, January 15, 2012
UniProtKB, the expertly curated
component of UniProt


 The high-quality curated protein knowledge database

     where data becomes structured knowledge




       “Pioneers at the Heart of Science” 1998 – 2008
                        PAG XX, San Diego, January 15, 2012
UniProtKB, the expertly curated
component of UniProt




                                                  Shigeo Fukuda
     “Pioneers at the Heart of Science” 1998 – 2008
                      PAG XX, San Diego, January 15, 2012
Protein sequence
             One gene - One species




© 2009 SIB
Protein and gene names
         Taxonomic information




                                   Protein sequence
                                  One gene - One species




© 2009 SIB
Protein and gene names
         Taxonomic information




                                                                Sequence annotation:
                                                            PTMs, alternative splicing products,
                                   Protein sequence        mutagenesis, transmembrane domains,
                                  One gene - One species              signal peptide…




© 2009 SIB
Protein and gene names
                                                                    General annotation:
         Taxonomic information                                  Function, Subcellular location,
                                                                       Catalytic activity,
                                                           Tissue specificity, Disruption phenotype…




                                                                                   Sequence annotation:
                                                                               PTMs, alternative splicing products,
                                   Protein sequence                           mutagenesis, transmembrane domains,
                                  One gene - One species                                 signal peptide…




© 2009 SIB
Protein and gene names
                                                                    General annotation:
         Taxonomic information                                  Function, Subcellular location,
                                                                       Catalytic activity,
                                                           Tissue specificity, Disruption phenotype…




                                                                                   Sequence annotation:
             References                                                        PTMs, alternative splicing products,
                                   Protein sequence                           mutagenesis, transmembrane domains,
                                  One gene - One species                                 signal peptide…




© 2009 SIB
Protein and gene names
                                                                    General annotation:
         Taxonomic information                                  Function, Subcellular location,
                                                                       Catalytic activity,
                                                           Tissue specificity, Disruption phenotype…




                                                                                   Sequence annotation:
             References                                                        PTMs, alternative splicing products,
                                   Protein sequence                           mutagenesis, transmembrane domains,
                                  One gene - One species                                 signal peptide…




                                                                                              Keywords
                                                                                                  -
                                                                                            Gene Ontology




© 2009 SIB
Protein and gene names
                                                                    General annotation:
         Taxonomic information                                  Function, Subcellular location,
                                                                       Catalytic activity,
                                                           Tissue specificity, Disruption phenotype…




                                                                                   Sequence annotation:
             References                                                        PTMs, alternative splicing products,
                                   Protein sequence                           mutagenesis, transmembrane domains,
                                  One gene - One species                                 signal peptide…




                                                                                              Keywords
   Cross-references                                                                               -
                                                                                            Gene Ontology
     (~ 130 databases)




© 2009 SIB
Origin of the sequences in UniProtKB


 International Nucleotide Sequence Database Collection
  (INSDC)
 Ensembl or EnsemblGenomes
 RefSeq
 Direct submissions (protein sequences)
 Literature
 Protein Data Bank


        “Pioneers at the Heart of Science” 1998 – 2008
                         PAG XX, San Diego, January 15, 2012
The process of manual sequence curation
    1. Select entry/gene (priorities)

    2. Identify entries from same gene and homologs
       using BLAST against UniProtKB

    3. Merge entries from the same gene and same
       species into a single record

    4. Select a canonical sequence


        “Pioneers at the Heart of Science” 1998 – 2008
                         PAG XX, San Diego, January 15, 2012
Critical analysis and report of sequence discrepancies
QPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720)




               “Pioneers at the Heart of Science” 1998 – 2008
                                PAG XX, San Diego, January 15, 2012
Critical analysis and report of sequence discrepancies
QPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720)




               “Pioneers at the Heart of Science” 1998 – 2008
                                PAG XX, San Diego, January 15, 2012
“Pioneers at the Heart of Science” 1998 – 2008
                 PAG XX, San Diego, January 15, 2012
Literature-based curation
 Identify relevant papers through searching literature
  databases




 Read full text of papers and extract and summarize
  relevant information




        “Pioneers at the Heart of Science” 1998 – 2008
                         PAG XX, San Diego, January 15, 2012
Literature-based curation




     “Pioneers at the Heart of Science” 1998 – 2008
                      PAG XX, San Diego, January 15, 2012
Literature-based curation




     “Pioneers at the Heart of Science” 1998 – 2008
                      PAG XX, San Diego, January 15, 2012
Literature-based curation




     “Pioneers at the Heart of Science” 1998 – 2008
                      PAG XX, San Diego, January 15, 2012
Controlled vocabularies
• Keywords provide a summary of the entry content
• We annotate using the Gene Ontology (GO)




      “Pioneers at the Heart of Science” 1998 – 2008
                       PAG XX, San Diego, January 15, 2012
UniProtKB, complete proteome
sequence sets
  • Genome completely sequenced

  • Proteins mapped to the genome

  2’902 complete proteomes

  Fully manually reviewed (e.g. S. cerevisiae)
  Partially manually reviewed (e.g. A. thaliana)
  Unreviewed (e.g. Chlorella variabilis)
       “Pioneers at the Heart of Science” 1998 – 2008
                        PAG XX, San Diego, January 15, 2012
UniProtKB, reference proteome
sequence sets
A reference proteome is the complete proteome of a
representative, well-studied model organism or an organism
of interest for biomedical research.

509 reference proteomes




       “Pioneers at the Heart of Science” 1998 – 2008
                        PAG XX, San Diego, January 15, 2012
UniProtKB, complete proteome
sequence sets




    “Pioneers at the Heart of Science” 1998 – 2008
                     PAG XX, San Diego, January 15, 2012
Arabidopsis thaliana



The building of the complete proteome sequence set:

• Based on the re-annotation of complete genome by TAIR:

  27’416 protein coding genes



       “Pioneers at the Heart of Science” 1998 – 2008
                        PAG XX, San Diego, January 15, 2012
UniProtKB – TAIR synchronization
   cDNAs, ESTs,
   genomic sequences


                                        Nucleic acid
                                         databases

    UniProtKB/TrEMBL
       Unreviewed
       (40’574 entries)



   UniProtKB/Swiss-Prot
        Reviewed
       (10’340 entries)


release 2011_03 - Mar 08, 2011



                       “Pioneers at the Heart of Science” 1998 – 2008
                                        PAG XX, San Diego, January 15, 2012
UniProtKB – TAIR synchronization
cDNAs, ESTs,
genomic sequences                                                       Genome re-annotation
                                                                         35’386 gene products

                                  Nucleic acid
                                   databases

UniProtKB/TrEMBL                                                        Temporary TrEMBL set
                                                                            33’341 entries
   Unreviewed
   (40’574 entries)



UniProtKB/Swiss-Prot
     Reviewed
   (10’340 entries)




                 “Pioneers at the Heart of Science” 1998 – 2008
                                  PAG XX, San Diego, January 15, 2012
UniProtKB – TAIR synchronization
cDNAs, ESTs,
genomic sequences                                                       Genome re-annotation
                                                                         35’386 gene products

                                  Nucleic acid
                                   databases

UniProtKB/TrEMBL                                                        Temporary TrEMBL set
                                                                             33’341 entries
   Unreviewed
   (40’574 entries)
                                                          11’508 sequences

UniProtKB/Swiss-Prot        Compare translations from the same gene, merge if 100 %
                              identical, report sequence discrepancies, align with
     Reviewed
   (10’340 entries)
                                             orthologs and paralogs




                 “Pioneers at the Heart of Science” 1998 – 2008
                                  PAG XX, San Diego, January 15, 2012
UniProtKB – TAIR synchronization
cDNAs, ESTs,
genomic sequences                                                      Genome re-annotation


                                 Nucleic acid
                                  databases

UniProtKB/TrEMBL                                                       Temporary TrEMBL set
   Unreviewed



UniProtKB/Swiss-Prot       Compare translations from the same gene, merge if 100 %
                             identical, report sequence discrepancies, align with
     Reviewed
                                            orthologs and paralogs
                                                                                  Feedback to TAIR
                                                                                      90 gene models


       correct gene models or add new isoforms
           283 corrections at the Heart of Science” 1998 – 2008
                “Pioneers
                                 PAG XX, San Diego, January 15, 2012
UniProtKB – TAIR synchronization
cDNAs, ESTs,
genomic sequences                                                     Genome re-annotation


                                Nucleic acid
                                 databases

UniProtKB/TrEMBL                                                      Temporary TrEMBL set
   Unreviewed



                                   Cleaned set of new TrEMBL entries
UniProtKB/Swiss-Prot
                                                (21’656 entries)
     Reviewed




               “Pioneers at the Heart of Science” 1998 – 2008
                                PAG XX, San Diego, January 15, 2012
UniProtKB – TAIR synchronization
    cDNAs, ESTs,
    genomic sequences                                                           Genome re-annotation


                                          Nucleic acid
                                           databases

    UniProtKB/TrEMBL                                                            Temporary TrEMBL set
       Unreviewed
       (44’628 entries)


                                             Cleaned set of new TrEMBL entries
   UniProtKB/Swiss-Prot
                                                          (21’656 entries)
        Reviewed
                                                              +
        (10’875 entries)
                                                    UniProtKB/Swiss-Prot
                                                  Reviewed (10’865 entries)
release 2011_12 - Dec 14, 2011

                                            Arabidopsis thaliana, cv. Columbia
                                            Complete proteome: 32’521 entries
                        “Pioneers at the Heart of Science” 1998 – 2008
                                          PAG XX, San Diego, January 15, 2012
1001 Arabidopsis genomes

• Deposited to INSDC ?

• Fully Annotated ? With CDS ?

• Should we still merge all the identical sequences together?

• If they are not merged but kept separate, how to get
  relevant Blast results?


        “Pioneers at the Heart of Science” 1998 – 2008
                         PAG XX, San Diego, January 15, 2012
Some UniProtKB/Swiss-Prot Statistics
concerning plant entries
(UniProt release 2011_12 - Dec 14, 2011)


• 31,959 entries of Viridiplantae
• from 1,924 species
• 10’875 entries from Arabidopsis thaliana (with 1,219 isoforms)
• 2,823 entries from Oryza sativa sp. Japonica
• 11,897 plant entries with an EC number
• 966 different complete EC numbers
• 5,744 putative transporters or proteins involved in transport
           “Pioneers at the Heart of Science” 1998 – 2008
                              PAG XX, San Diego, January 15, 2012
Summary
UniProtKB/Swiss-Prot, the manually curated knowledgebase:

• Protein sequence database covering all kingdoms of life (533’657
  sequence entries; 12’664 species)
• Manually annotated
• Non-redundant: all products of one gene in one species in a single entry
• Highly cross-referenced (links to ~130 databases).

Plant protein annotation:

• Complete proteome for Arabidopsis thaliana

• Synchronization with TAIR

         “Pioneers at the Heart of Science” 1998 – 2008
                            PAG XX, San Diego, January 15, 2012
We need your feedback and your collaboration !

                   help@uniprot.org




      “Pioneers at the Heart of Science” 1998 – 2008
                       PAG XX, San Diego, January 15, 2012
Acknowledgements
SIB
Ioannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, Marie-Claude Blatter,
Brigitte Boeckmann, Jerven Bolleman, Laurent Bollondi, Emmanuel Boutet, Lionel Breuza, Alan Bridge, Edouard de
Castro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice Cuche, Mikael Doche, Dolnide Dornevil, Severine Duvaud, Anne
Estreicher, Livia Famiglietti, Marc Feuermann, Sebastien Gehant, Elisabeth Gasteiger, Vivienne Gerritsen, Arnaud Gos,
Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Janet James, Florence Jungo, Guillaume Keller,
Vicente Lara, Philippe Lemercier, Damien Lieberherr, Xavier Martin, Patrick Masson, Anne Morgat, Salvo Paesano, Ivo
Pedruzzi, Sandrine Pilbout, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole Redaschi, Catherine Rivoire, Bernd
Roechert, Michel Schneider, Christian Sigrist, Karin Sonesson, Sylvie Staehli, Eleanor Stanley, André Stutz, Shyamala
Sundaram, Michael Tognolli, Laure Verbregue and Anne-Lise Veuthey

EBI
Rolf Apweiler, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Yasmin Alam-Faruque, Ricardo Antunes,
Benoit Bely, Mark Bingley, David Binns, Lawrence Bower, Wei Mun Chan, Emily Dimmer, Francesco Fazzini, Alexander
Fedotov, John Garavelli, Leyla Garcia Castro, Rachael Huntley, Julius Jacobsen, Michael Kleen, Duncan Legge, Wudong
Liu, Jie Luo, Sandra Orchard, Samuel Patient, Klemens Pichler, Diego Poggioli, Nikolas Pontikos, Steven Rosanoff, Tony
Sawford, Harminder Sehra, Edward Turner, Matt Corbett, Mike Donnelly and Pieter van Rensburg

PIR
Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Winona C. Barker, Chuming Chen, Yongxing Chen, Pratibha Dubey,
Hongzhan Huang, Kati Laiho, Raja Mazumder, Peter McGarvey, Darren A. Natale, Thanemozhi G. Natarajan, Jules
Nchoutmboube, Natalia V. Roberts, Baris E. Suzek, Uzoamaka Ugochukwu, C. R. Vinayaka, Qinghua Wang, Yuqi Wang,
Lai-Su Yeh and Jian Zhang




                                      www.uniprot.org
UniProt is mainly supported by the National Institutes of
Health (NIH) grant 1 U41 HG006104-01. Additional support for
the EBI's involvement in UniProt comes from the NIH grant
2P41 HG02273-07. Swiss-Prot activities at the SIB are
supported by the Swiss Federal Government through the
Federal Office of Education and Science and the European
Commission contracts SLING (226073), Gen2Phen (200754)
and MICROME (222886). PIR activities are also supported by
the NIH grants 5R01GM080646-04, 3R01GM080646-04S2,
1G08LM010720-01, and 3P20RR016472-09S2, and NSF grant
DBI-0850319.



       “Pioneers at the Heart of Science” 1998 – 2008
                        PAG XX, San Diego, January 15, 2012

Weitere ähnliche Inhalte

Andere mochten auch

GenBank Coding Sequences
GenBank Coding SequencesGenBank Coding Sequences
GenBank Coding Sequences
Benoit Leclerc
 
Types of PCR ((APEH Daniel O.))
Types of  PCR ((APEH Daniel O.))Types of  PCR ((APEH Daniel O.))
Types of PCR ((APEH Daniel O.))
Daniel Apeh
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
nadeem akhter
 

Andere mochten auch (20)

EMBL-EBI
EMBL-EBIEMBL-EBI
EMBL-EBI
 
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...
 
Protein Data Bank
Protein Data BankProtein Data Bank
Protein Data Bank
 
Biological databases
Biological databasesBiological databases
Biological databases
 
GenBank Coding Sequences
GenBank Coding SequencesGenBank Coding Sequences
GenBank Coding Sequences
 
Science Big, Science Connected
Science Big, Science ConnectedScience Big, Science Connected
Science Big, Science Connected
 
UniProtKB/Swiss-Prot:Why sparql?
UniProtKB/Swiss-Prot:Why sparql?UniProtKB/Swiss-Prot:Why sparql?
UniProtKB/Swiss-Prot:Why sparql?
 
Types of PCR ((APEH Daniel O.))
Types of  PCR ((APEH Daniel O.))Types of  PCR ((APEH Daniel O.))
Types of PCR ((APEH Daniel O.))
 
Types of pcr
Types of pcr Types of pcr
Types of pcr
 
Site directed mutagenesis by pcr
Site directed mutagenesis by pcrSite directed mutagenesis by pcr
Site directed mutagenesis by pcr
 
PCR types and applications
PCR types and applicationsPCR types and applications
PCR types and applications
 
Nucleic Acid Sequence databases
Nucleic Acid Sequence databasesNucleic Acid Sequence databases
Nucleic Acid Sequence databases
 
ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001
ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001
ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001
 
PCR
PCRPCR
PCR
 
Gene silencing last
Gene silencing lastGene silencing last
Gene silencing last
 
Real time PCR
Real time PCRReal time PCR
Real time PCR
 
Gene silencing
Gene silencing Gene silencing
Gene silencing
 
Site directed mutagenesis
Site directed mutagenesisSite directed mutagenesis
Site directed mutagenesis
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
PCR, Real Time PCR
PCR, Real Time PCRPCR, Real Time PCR
PCR, Real Time PCR
 

Ähnlich wie The annotation of plant proteins in UniProtKB

Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Reece Hart
 
Proteomics course 1
Proteomics course 1Proteomics course 1
Proteomics course 1
utpaltatu
 
Genomics and proteomics II
Genomics and proteomics IIGenomics and proteomics II
Genomics and proteomics II
Nikolay Vyahhi
 
Reference Data Integration: A Strategy for the Future
Reference Data Integration: A Strategy for the FutureReference Data Integration: A Strategy for the Future
Reference Data Integration: A Strategy for the Future
Barry Smith
 
Computational Protein Design. 1. Challenges in Protein Engineering
Computational Protein Design. 1. Challenges in Protein EngineeringComputational Protein Design. 1. Challenges in Protein Engineering
Computational Protein Design. 1. Challenges in Protein Engineering
Pablo Carbonell
 

Ähnlich wie The annotation of plant proteins in UniProtKB (20)

Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic mining
 
Biopharmaceutical
BiopharmaceuticalBiopharmaceutical
Biopharmaceutical
 
Biopharma Solution
Biopharma SolutionBiopharma Solution
Biopharma Solution
 
Bairoch ISB closing-talk: CALIPHO
Bairoch ISB closing-talk: CALIPHOBairoch ISB closing-talk: CALIPHO
Bairoch ISB closing-talk: CALIPHO
 
Proteomics course 1
Proteomics course 1Proteomics course 1
Proteomics course 1
 
Stephen Friend Fanconi Anemia Research Fund 2012-01-21
Stephen Friend Fanconi Anemia Research Fund 2012-01-21Stephen Friend Fanconi Anemia Research Fund 2012-01-21
Stephen Friend Fanconi Anemia Research Fund 2012-01-21
 
Omics in plant breeding
Omics in plant breedingOmics in plant breeding
Omics in plant breeding
 
Identification of pathological mutations from the single-gene case to exome p...
Identification of pathological mutations from the single-gene case to exome p...Identification of pathological mutations from the single-gene case to exome p...
Identification of pathological mutations from the single-gene case to exome p...
 
Specificity Assessment At Santaris Pharma
Specificity Assessment At Santaris PharmaSpecificity Assessment At Santaris Pharma
Specificity Assessment At Santaris Pharma
 
Protein function and bioinformatics
Protein function and bioinformaticsProtein function and bioinformatics
Protein function and bioinformatics
 
Molecular quantitative genetics for plant breeding roundtable 2010x
Molecular quantitative genetics for plant breeding roundtable 2010xMolecular quantitative genetics for plant breeding roundtable 2010x
Molecular quantitative genetics for plant breeding roundtable 2010x
 
Genomics and proteomics II
Genomics and proteomics IIGenomics and proteomics II
Genomics and proteomics II
 
Selection of Safer and More Effective Anti-inflammatory Kinase Inhibitors usi...
Selection of Safer and More Effective Anti-inflammatory Kinase Inhibitors usi...Selection of Safer and More Effective Anti-inflammatory Kinase Inhibitors usi...
Selection of Safer and More Effective Anti-inflammatory Kinase Inhibitors usi...
 
Surp09 Signaling
Surp09 SignalingSurp09 Signaling
Surp09 Signaling
 
The Phenoscape Knowledgebase
The Phenoscape KnowledgebaseThe Phenoscape Knowledgebase
The Phenoscape Knowledgebase
 
Reference Data Integration: A Strategy for the Future
Reference Data Integration: A Strategy for the FutureReference Data Integration: A Strategy for the Future
Reference Data Integration: A Strategy for the Future
 
Experimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome ProjectExperimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome Project
 
TDikow Hennig 2011
TDikow Hennig 2011TDikow Hennig 2011
TDikow Hennig 2011
 
Computational Protein Design. 1. Challenges in Protein Engineering
Computational Protein Design. 1. Challenges in Protein EngineeringComputational Protein Design. 1. Challenges in Protein Engineering
Computational Protein Design. 1. Challenges in Protein Engineering
 
Proteomics in VSC for crop improvement programme
Proteomics in VSC for crop improvement programmeProteomics in VSC for crop improvement programme
Proteomics in VSC for crop improvement programme
 

Mehr von EBI

UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
EBI
 
The European Nucleotide Archive
The European Nucleotide ArchiveThe European Nucleotide Archive
The European Nucleotide Archive
EBI
 
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl GenomesGenome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
EBI
 
Automatic Annotation in UniProtKB
Automatic Annotation in UniProtKBAutomatic Annotation in UniProtKB
Automatic Annotation in UniProtKB
EBI
 
The Vertebrate Genome Annotation Database
The Vertebrate Genome Annotation DatabaseThe Vertebrate Genome Annotation Database
The Vertebrate Genome Annotation Database
EBI
 

Mehr von EBI (7)

UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
 
InterPro and InterProScan 5.0
InterPro and InterProScan 5.0InterPro and InterProScan 5.0
InterPro and InterProScan 5.0
 
The European Nucleotide Archive
The European Nucleotide ArchiveThe European Nucleotide Archive
The European Nucleotide Archive
 
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl GenomesGenome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
 
Automatic Annotation in UniProtKB
Automatic Annotation in UniProtKBAutomatic Annotation in UniProtKB
Automatic Annotation in UniProtKB
 
The Vertebrate Genome Annotation Database
The Vertebrate Genome Annotation DatabaseThe Vertebrate Genome Annotation Database
The Vertebrate Genome Annotation Database
 
Train online
Train onlineTrain online
Train online
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

The annotation of plant proteins in UniProtKB

  • 1. The annotation of Plant Proteins in UniProtKB Michel Schneider Plant protein annotation program, Swiss-Prot group Swiss Institute of Bioinformatics Geneva, Switzerland Michel.Schneider@isb-sib.ch
  • 2. 1. The UniProt consortium and its products 2. Content of an entry in UniProtKB and manual curation 3. Complete proteomes and reference proteomes 4. Synchronization between UniProtKB and TAIR 5. Some statistics “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 3. The UniProt consortium “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 4. The missions of the UniProt consortium Provide the scientific community with a resource of protein sequence and functional annotation which has to be …  comprehensive  high quality  and freely accessible “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 5. Four components to fulfill specific demands UniProtKB Protein Knowledgebase UniRef UniProtKB/Swiss-Prot UniMes Sequence clusters Reviewed Metagenomic and UniRef100 (533’657 entries) UniRef90 environmental Manual curation sample sequences UniRef50 UniProtKB/Trembl Unreviewed (19 million entries) Automated annotation UniParc – Sequence archive contains current and obsolete sequences (29.6 million sequences) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 6. UniProtKB, the expertly curated component of UniProt The high-quality curated protein knowledge database where data becomes structured knowledge “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 7. UniProtKB, the expertly curated component of UniProt Shigeo Fukuda “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 8. Protein sequence One gene - One species © 2009 SIB
  • 9. Protein and gene names Taxonomic information Protein sequence One gene - One species © 2009 SIB
  • 10. Protein and gene names Taxonomic information Sequence annotation: PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide… © 2009 SIB
  • 11. Protein and gene names General annotation: Taxonomic information Function, Subcellular location, Catalytic activity, Tissue specificity, Disruption phenotype… Sequence annotation: PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide… © 2009 SIB
  • 12. Protein and gene names General annotation: Taxonomic information Function, Subcellular location, Catalytic activity, Tissue specificity, Disruption phenotype… Sequence annotation: References PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide… © 2009 SIB
  • 13. Protein and gene names General annotation: Taxonomic information Function, Subcellular location, Catalytic activity, Tissue specificity, Disruption phenotype… Sequence annotation: References PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide… Keywords - Gene Ontology © 2009 SIB
  • 14. Protein and gene names General annotation: Taxonomic information Function, Subcellular location, Catalytic activity, Tissue specificity, Disruption phenotype… Sequence annotation: References PTMs, alternative splicing products, Protein sequence mutagenesis, transmembrane domains, One gene - One species signal peptide… Keywords Cross-references - Gene Ontology (~ 130 databases) © 2009 SIB
  • 15. Origin of the sequences in UniProtKB  International Nucleotide Sequence Database Collection (INSDC)  Ensembl or EnsemblGenomes  RefSeq  Direct submissions (protein sequences)  Literature  Protein Data Bank “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 16. The process of manual sequence curation 1. Select entry/gene (priorities) 2. Identify entries from same gene and homologs using BLAST against UniProtKB 3. Merge entries from the same gene and same species into a single record 4. Select a canonical sequence “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 17. Critical analysis and report of sequence discrepancies QPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 18. Critical analysis and report of sequence discrepancies QPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 19. “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 20. Literature-based curation  Identify relevant papers through searching literature databases  Read full text of papers and extract and summarize relevant information “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 21. Literature-based curation “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 22. Literature-based curation “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 23. Literature-based curation “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 24. Controlled vocabularies • Keywords provide a summary of the entry content • We annotate using the Gene Ontology (GO) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 25. UniProtKB, complete proteome sequence sets • Genome completely sequenced • Proteins mapped to the genome 2’902 complete proteomes Fully manually reviewed (e.g. S. cerevisiae) Partially manually reviewed (e.g. A. thaliana) Unreviewed (e.g. Chlorella variabilis) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 26. UniProtKB, reference proteome sequence sets A reference proteome is the complete proteome of a representative, well-studied model organism or an organism of interest for biomedical research. 509 reference proteomes “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 27. UniProtKB, complete proteome sequence sets “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 28. Arabidopsis thaliana The building of the complete proteome sequence set: • Based on the re-annotation of complete genome by TAIR: 27’416 protein coding genes “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 29. UniProtKB – TAIR synchronization cDNAs, ESTs, genomic sequences Nucleic acid databases UniProtKB/TrEMBL Unreviewed (40’574 entries) UniProtKB/Swiss-Prot Reviewed (10’340 entries) release 2011_03 - Mar 08, 2011 “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 30. UniProtKB – TAIR synchronization cDNAs, ESTs, genomic sequences Genome re-annotation 35’386 gene products Nucleic acid databases UniProtKB/TrEMBL Temporary TrEMBL set 33’341 entries Unreviewed (40’574 entries) UniProtKB/Swiss-Prot Reviewed (10’340 entries) “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 31. UniProtKB – TAIR synchronization cDNAs, ESTs, genomic sequences Genome re-annotation 35’386 gene products Nucleic acid databases UniProtKB/TrEMBL Temporary TrEMBL set 33’341 entries Unreviewed (40’574 entries) 11’508 sequences UniProtKB/Swiss-Prot Compare translations from the same gene, merge if 100 % identical, report sequence discrepancies, align with Reviewed (10’340 entries) orthologs and paralogs “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 32. UniProtKB – TAIR synchronization cDNAs, ESTs, genomic sequences Genome re-annotation Nucleic acid databases UniProtKB/TrEMBL Temporary TrEMBL set Unreviewed UniProtKB/Swiss-Prot Compare translations from the same gene, merge if 100 % identical, report sequence discrepancies, align with Reviewed orthologs and paralogs Feedback to TAIR 90 gene models correct gene models or add new isoforms 283 corrections at the Heart of Science” 1998 – 2008 “Pioneers PAG XX, San Diego, January 15, 2012
  • 33. UniProtKB – TAIR synchronization cDNAs, ESTs, genomic sequences Genome re-annotation Nucleic acid databases UniProtKB/TrEMBL Temporary TrEMBL set Unreviewed Cleaned set of new TrEMBL entries UniProtKB/Swiss-Prot (21’656 entries) Reviewed “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 34. UniProtKB – TAIR synchronization cDNAs, ESTs, genomic sequences Genome re-annotation Nucleic acid databases UniProtKB/TrEMBL Temporary TrEMBL set Unreviewed (44’628 entries) Cleaned set of new TrEMBL entries UniProtKB/Swiss-Prot (21’656 entries) Reviewed + (10’875 entries) UniProtKB/Swiss-Prot Reviewed (10’865 entries) release 2011_12 - Dec 14, 2011 Arabidopsis thaliana, cv. Columbia Complete proteome: 32’521 entries “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 35. 1001 Arabidopsis genomes • Deposited to INSDC ? • Fully Annotated ? With CDS ? • Should we still merge all the identical sequences together? • If they are not merged but kept separate, how to get relevant Blast results? “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 36. Some UniProtKB/Swiss-Prot Statistics concerning plant entries (UniProt release 2011_12 - Dec 14, 2011) • 31,959 entries of Viridiplantae • from 1,924 species • 10’875 entries from Arabidopsis thaliana (with 1,219 isoforms) • 2,823 entries from Oryza sativa sp. Japonica • 11,897 plant entries with an EC number • 966 different complete EC numbers • 5,744 putative transporters or proteins involved in transport “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 37. Summary UniProtKB/Swiss-Prot, the manually curated knowledgebase: • Protein sequence database covering all kingdoms of life (533’657 sequence entries; 12’664 species) • Manually annotated • Non-redundant: all products of one gene in one species in a single entry • Highly cross-referenced (links to ~130 databases). Plant protein annotation: • Complete proteome for Arabidopsis thaliana • Synchronization with TAIR “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 38. We need your feedback and your collaboration ! help@uniprot.org “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012
  • 39. Acknowledgements SIB Ioannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, Marie-Claude Blatter, Brigitte Boeckmann, Jerven Bolleman, Laurent Bollondi, Emmanuel Boutet, Lionel Breuza, Alan Bridge, Edouard de Castro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice Cuche, Mikael Doche, Dolnide Dornevil, Severine Duvaud, Anne Estreicher, Livia Famiglietti, Marc Feuermann, Sebastien Gehant, Elisabeth Gasteiger, Vivienne Gerritsen, Arnaud Gos, Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Janet James, Florence Jungo, Guillaume Keller, Vicente Lara, Philippe Lemercier, Damien Lieberherr, Xavier Martin, Patrick Masson, Anne Morgat, Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole Redaschi, Catherine Rivoire, Bernd Roechert, Michel Schneider, Christian Sigrist, Karin Sonesson, Sylvie Staehli, Eleanor Stanley, André Stutz, Shyamala Sundaram, Michael Tognolli, Laure Verbregue and Anne-Lise Veuthey EBI Rolf Apweiler, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Yasmin Alam-Faruque, Ricardo Antunes, Benoit Bely, Mark Bingley, David Binns, Lawrence Bower, Wei Mun Chan, Emily Dimmer, Francesco Fazzini, Alexander Fedotov, John Garavelli, Leyla Garcia Castro, Rachael Huntley, Julius Jacobsen, Michael Kleen, Duncan Legge, Wudong Liu, Jie Luo, Sandra Orchard, Samuel Patient, Klemens Pichler, Diego Poggioli, Nikolas Pontikos, Steven Rosanoff, Tony Sawford, Harminder Sehra, Edward Turner, Matt Corbett, Mike Donnelly and Pieter van Rensburg PIR Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Winona C. Barker, Chuming Chen, Yongxing Chen, Pratibha Dubey, Hongzhan Huang, Kati Laiho, Raja Mazumder, Peter McGarvey, Darren A. Natale, Thanemozhi G. Natarajan, Jules Nchoutmboube, Natalia V. Roberts, Baris E. Suzek, Uzoamaka Ugochukwu, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, Lai-Su Yeh and Jian Zhang www.uniprot.org
  • 40. UniProt is mainly supported by the National Institutes of Health (NIH) grant 1 U41 HG006104-01. Additional support for the EBI's involvement in UniProt comes from the NIH grant 2P41 HG02273-07. Swiss-Prot activities at the SIB are supported by the Swiss Federal Government through the Federal Office of Education and Science and the European Commission contracts SLING (226073), Gen2Phen (200754) and MICROME (222886). PIR activities are also supported by the NIH grants 5R01GM080646-04, 3R01GM080646-04S2, 1G08LM010720-01, and 3P20RR016472-09S2, and NSF grant DBI-0850319. “Pioneers at the Heart of Science” 1998 – 2008 PAG XX, San Diego, January 15, 2012

Hinweis der Redaktion

  1. Alignment of sequences deduced from 2 genomic DNAs, one cDNA and one ESTAnnotation of erroneous gene model predictions
  2. Annotation of isoforms
  3. Information about how to reconstruct all isoformsAccess to the sequences of all isoformsCan apply various tools
  4. The sequencing of 1001 Arabidopsis genomes is raising several questions and we have to find new solutionsIf not merged, one solution for the blast is to use UniRef, but only valid for functional annotation and not for finding if an homologous protein is already known in a given species