Presented at the ACS in Dallas: ChEBI is a database and ontology of chemical entities of biological interest, organised into a structure-based and role-based classification hierarchy. Each entry is extensively annotated with a name, definition and synonyms, other metadata such as cross-references, and chemical structure information where appropriate. In addition to the
classification hierarchy, the ontology also contains diverse chemical and ontological relationships. While ChEBI is primarily manually maintained, recent developments have focused on improvements in curation through partial automation of common tasks. We will describe a pipeline we have developed for structure-based classification of chemicals into the ChEBI structural classification. The pipeline connects class-level structural knowledge encoded in Web Ontology Language (OWL) axioms as an extension to the ontology, and structural information specified in standard MOLfiles. We make use of the Chemistry Development Kit, the OWL API and the OWLTools library. Harnessing the pipeline, we are able to suggest the best structural classes for the classification of novel structures within the ChEBI ontology.
Pipeline for automated structure-based classification in the ChEBI ontology
1. Pipeline for automated structure-based
classification in the ChEBI ontology
Janna Hastings
Coordinator,
Cheminformatics and Metabolism
www.ebi.ac.uk/chebi
ACS Symposium on Chemical Ontologies,
Taxonomies and Schemas. Dallas, 16 March 2014
2. Chemical Entities of Biological Interest
Freely available
online, available
for download in full
Freely available
online, available
for download in full
Low molecular weight,
i.e. no proteins
Low molecular weight,
i.e. no proteins
Definitions,
relationships,
hierarchy
Definitions,
relationships,
hierarchy
E.g.
metabolites,
drugs,
pesticides
E.g.
metabolites,
drugs,
pesticides
38,215 entries last
release
38,215 entries last
release
3. What does ChEBI provide?
Chemical structures and
visualisations
caffeine
1,3,7-trimethylxanthine
methyltheobromine
Names and synonyms
Formula: C8H10N4O2
Charge: 0
Mass: 194.19
Chemical data
metabolite
CNS stimulant
trimethylxanthines
Ontology –
classifications
MSDchem: CFF
KEGG DRUG: D00528
PubMed citations
Links to more
information
Chemical Informatics
InChI=1/C8H10N4O2/c1-10-4-9-6-
5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
SMILES CN1C(=O)N(C)c2ncn(C)c2C1=O
8. Challenges with manual classification
• May be incomplete
• May be inconsistent
• Difficult to maintain (even with extensive use of
computationally expensive automatic validations)
• Blocks automatic loading of otherwise high-quality
externally annotated chemical data into ChEBI
(as no classification available)
9. SOCO (SMARTS, OWL)
Leonid Chepelev, Michel Dumontier, collaborators
• Given a training set of classified molecules, examine
structures for consensus features across all (using
fragmentation and feature detection)
• Capture features hierarchically
• Use OWL to classify
Chepelev et al. BMC Bioinformatics 2012 13:3 doi:10.1186/1471-2105-13-3
10. Limitations of SOCO
• No support for negation
• Only “min” (at least) counting supported, not max or
exact. Thus, dicarboxylic acid is_a monocarboxylic acid
(Every two-legged human is also a one-legged human in the sense
that they have at least one leg…)
• SMARTS is powerful – but not very human-readable.
ChEBI is for human biologist and chemist consumption.
E.g. SMARTS for the class of aliphatic amines: [$([NH2][CX4]),$
([NH]([CX4])[CX4]),$[NX3]([CX4])([CX4])[CX4])]
Can we do better at making definitions accessible?
11. A new pipeline for automated structure-
based ontology classification in ChEBI
Definitions (OWL)
ChEBI structures
OWL Parser =>
logical
cheminformatics
definitions
OWL Parser =>
logical
cheminformatics
definitions
Novel
structure
Candidate
classes
RankingRankingBest classes: save is_a relations
MatchingMatching
12. Human-readable definitions, mapped to
structures in ChEBI knowledgebase
thiadiazoles:
molecular_entity and has_part
some ( 1,2,3-thiadiazole or 1,2,4-thiadiazole
or 1,2,5-thiadiazole or 1,3,4-thiadiazole )
diterpenoid: organic_molecular_entity and
has_part exactly 2 terpenoid
organic ion: organic_molecular_entity and
( has_charge some int[>0] or has_charge some int[<0] )
monocyclic compound: molecular_entity and
has_cycles value "1"^^int
Logical operatorsLogical operators
Counts (min, max
and exact)
Counts (min, max
and exact)
PropertiesProperties
PartsParts
13. Planned integration into ChEBI tools
• ChEBI internal data loader and bulk submissions
• ChEBI online submission tool
Pre-population
of matched
classes
Pre-population
of matched
classes
14. Acknowledgements – Thanks!
ChEBI team:
Christoph Steinbeck
Gareth Owen
Adriano Dekker
Namrata Kale
Steve Turner
Venkatesh Muthukrishnan
Collaborators:
Colin Batchelor, RSC
Lian Duan, ETH
Leonid Chepelev, Ottawa
Michel Dumontier, Stanford
Despoina Magka, Oxford
Ilinca Tudose and John May, EBI
Funding:
BBSRC “Continued
development of ChEBI towards
better usability for the systems
biology and metabolic
modelling communities”
BB/K019783/1
15. Questions?
Thank you for listening!
chebi-help@ebi.ac.uk
ACS Symposium on Chemical Ontologies,
Taxonomies and Schemas. Dallas, 16 March 2014
Hinweis der Redaktion
ChEBI is a database and ontology of chemical entities of biological
interest. As of October 2013, it contains more than 35,000 entries,
organised into a structure-based and role-based classification hierarchy.
Each entry is extensively annotated with a name, definition and
synonyms, other metadata such as cross-references, and chemical
structure information where appropriate. In addition to the
classification hierarchy, the ontology also contains diverse chemical
and ontological relationships. While ChEBI is primarily manually
maintained, recent developments have focused on improvements in
curation through partial automation of common tasks. We will describe
a pipeline we have developed for structure-based classification of
chemicals into the ChEBI structural classification. The pipeline
connects class-level structural knowledge encoded in Web Ontology
Language (OWL) axioms as an extension to the ontology, and structural
information specified in standard MOLfiles. We make use of the
Chemistry Development Kit, the OWL API and the OWLTools library.
Harnessing the pipeline, we are able to suggest the best structural classes
for the classification of novel structures within the ChEBI ontology.