PomBase Community Curation: A Fast Track to Capture Expert Knowledge, Antonia Lock, Kim Rutherford, Midori Harris, Mark Mcdowall, Paul Kersey, Stephen Oliver, Jurg Bahler and Valerie Wood.
Presented at the 5th International Biocuration Conference, hosted by PIR in Washington, DC, April 2-4, 2012.
2. The S. pombe Community
¡ Medium-sized research community
¡ >200 labs, 1300 subscribe to mailing list
¡ Close-knit
¡ GeneDB S. pombe model organism database set up in 2004
¡ Maintained by one person (V. Wood)
¡ Mainly GO annotation
¡ Problem:
¡ Needed to support additional types of data
¡ Too many publications to curate considering the available
man-power
3.
4. The Community Curation
Initiative
¡ Pilot study in 2009
¡ Highly successful
¡ 29/44 responded (no follow up for non-responders)
¡ ~360 new annotations
¡ Annotations were generally of high quality – errors easy to spot
¡ Enabled a dialogue between author and curators
¡ Process must be simplified
¡ Need for a simple tool in which to do the curation, instead of a
complicated word document
¡ 2010 – Wellcome Trust grant
¡ to develop and implement a community curation tool
¡ Also to develop a new fission yeast database ‘PomBase’ which will
support a range of additional data-types not previously captured in
GeneDB
5. Data captured in GeneDB
vs. PomBase
Data type Ontology GeneDB PomBase
Function/Process/Component GO ✔ ✔
Protein modifications Protein Modification - ✔
Ontology
Phenotypes FYPO (Fission Yeast Some ✔
Phenotype Ontology)
Interactions BioGRID BioGRID ✔
Gene expression In-house vocabulary - ✔
Misc features (disease In-house vocabulary ✔ ✔
associations, complementation…)
The increased breadth makes community curation even more important
6. Phenotype Ontology
¡ User survey 2007 - Phenotypes were identified as the single most
desirable information type not supported by GeneDB S.
pombe.
¡ Need for a pre-composed Fission Yeast Phenotype Ontology
¡ Ease for community curation
¡ Needed greater specificity of terms than that offered by existing
phenotype ontologies
¡ Term is accompanied by two types of information:
¡ Allele description – deletion, overexpression of mutation
¡ Experimental conditions where appropriate
¡ Combination of different ontologies used to create formal definitions
¡ E.g. PATO, ChEBI, GO
PATO FYPO ChEBI
resistance to resistance to thiabendazole thiabendazole
7. GO Term Extensions
GO
ID
Term
Evidence
With/From
Source
GO:004674
Protein
serine/threonine
kinase
ac<vity
has_substrate
pom1
IDA
Yoon
HJ
et
al.
(2006)
has_substrate
rum1
IDA
Noguchi
E
et
al.
(2002)
has_substrate
rbp80
IDA
Holig
K
et
al.
(2009)
has_substrate
sin1
IDA
Jang
YJ
et
al.
(1997)
8. Why Not a Wiki?
¡ Traditionally biologists would study one gene/protein
¡ Individual text-based gene pages were an ideal format
¡ Many techniques used today generate gene lists
¡ Enrichment identify patterns in the data-set e.g. are certain
processes common the group of genes?
¡ Need annotations to controlled vocabularies to make efficient,
computerized comparisons
¡ A wiki, essentially free-text, does not provide this
¡ All annotations are supported by evidence
9. What Will the
Community Curate?
¡ Data that can be captured by the formal vocabularies used in
PomBase
¡ GO (including extensions)
¡ Protein modifications (including residue information)
¡ Phenotypes (including alleles and conditions)
¡ Interactions
¡ Mostly pre-composed terms
¡ Extensions will be captured by prompting where relevant
¡ E.g. the community will not be expected to know when to use these
10. The Community Annotation
Tool - CANTO
¡ Final stages of development
¡ Developed by Kim Rutherford
¡ Already in use by the PomBase curators
¡ We are involving the community at this stage through review of
curated (recent) publications
¡ Provides a web-based interface
¡ Can be used as a stand-alone application (provides annotations in
GAFs)
¡ Pipelines are in place for direct loading into Chado
¡ Chado (GMOD project) is a database schema for handling
biological data
11. 5 Easy Steps to Broad
Curation of Data
- A Walk-through
22. Quality Control and
Consistency Checking
¡ Professional curators are needed not just for
curation support, but also for quality control and
consistency checking.
26. Benefits of Community
Curation
¡ Researchers can curate ‘from home’ immediately following
publication
¡ First-pass annotations quickly obtained – data will quickly appear in the
database
¡ Expert knowledge, coupled to quality control by curators make for
powerful, accurate annotations
¡ Controlled annotations can be loaded from the tool directly into our
database
¡ Bottle-neck is how quickly professional curators can check
annotations, not how fast we can obtain them
¡ Frees up time for us to clear the back-log of papers
27. Benefits to the Researcher
¡ Greater visibility of
publication
¡ Annotations propagated to
GO, BioGRID, Ensembl, NCBI,
UniProt…
¡ Increased citation index?
¡ A greater understanding of
ontologies
¡ Will be able to use them
better to support their
research
28. Future Directions
¡ ~3 months until official launch of CANTO
¡ Multi-gene phenotypes
¡ Extensions (restricted usage for specific terms and
relations)
¡ More help features and descriptive boxes
¡ Longer term
¡ Making the tool easily configurable for other
organisms
¡ Making the tool available to other communities
29. Acknowledgements
¡ The PomBase team:
¡ Val Wood
¡ Midori Harris
¡ Kim Rutherford
¡ Mark McDowall
¡ Antonia Lock
¡ PI’s:
¡ Jurg Bahler (UCL)
¡ Steve Oliver (Cambridge)
¡ Paul Kersey (EBI Hinxton)
¡ Funded by the Wellcome
Trust