Subliminal: exploiting semantic annotations in the reconstruction of metabolic networks
1. This work has been supported by the BBSRC/EPSRC grant: the Manchester Centre for Integrative Systems Biology
Subliminal: exploiting semantic
annotations in the reconstruction of
metabolic networks
Neil Swainston
Manchester Centre for Integrative Systems Biology, University of Manchester, Manchester M1 7ND, UK
Introduction
The development of metabolic network reconstructions has increased in recent years. It now covers a range of organisms and has been applied
to a number of research topics including metabolic engineering, genome-annotation, evolutionary studies, network property analysis, and
interpretation of omics datasets1.
The process of developing such reconstructions is now defined and is recognised as being time-consuming2. While many of the steps associated
with generating a high-quality reconstruction require manual curation, some of these are applicable to automation, providing the possibility of
automating the process of generating a draft reconstruction to be used in subsequent manual curation3.
The importance of using standard representations such as SBML4 and the MIRIAM standard5 has been recognised6, with the development of
reconstructions in which all components are semantically annotated with unambiguous database identifiers greatly facilitating their use by third
parties.
However, to date, the use of semantic annotations has been focused on the usability of the reconstruction after publication. Subliminal
comprises a toolbox that exploits semantic annotations during the reconstruction process, utilising libAnnotationSBML7 and web service
interfaces to external databases such as ChEBI8 and UniProt9 to retrieve chemical and protein data which can be used in the automation of
chemical protonation state determination, reaction mass / charge balancing and enzyme (and reaction) localisation.
Initial pre-draft pathways: KEGG2SBML and other sources
Initial pre-draft pathways for a given organism are generated from the existing KEGG2SBML10 tool. KEGG2SBML
generates SBML files representing individual metabolic pathways, which are then enhanced by addition of semantic
annotations such as references to ChEBI and UniProt ids for metabolites and enzymes respectively, and EC terms.
Subsequent work will focus on generating additional pathways from MetaCyc11 and genome sequences.
Model merging: pre-draft reconstruction
As each of the initial pre-draft pathways, irrespective of their source, are semantically annotated with comparable terms,
each can be merged automatically to generate a pre-draft reconstruction in which duplicate metabolites, enzymes and
reactions are removed.
Protonation state prediction
Automated acquisition from the ChEBI database of the InChI12 (or SMILES) string representing each metabolite allows
protonation state of the metabolite at a given pH to be predicted using cheminformatic resources such as the Chemistry
Development Kit (CDK)13.
Reaction mass / charge balancing
By acquiring the chemical formulae and charge of each metabolite from the ChEBI database, each reaction can be
represented as an matrix, A, containing elements and charges for each reactant and product. The vector, b, represents
Ab = 0 the stoichiometric coefficients of each reactant. Mixed integer linear programming can be applied to solve Ab = 0,
producing a vector of stoichiometric coefficients to be applied to each reactant and product. Commonly absent species,
such as water, protons and CO2, can also be considered, allowing previously unbalancable reactions (for example, from
KEGG) to be balanced automatically.
Protein localisation
With each enzyme being annotated with UniProt terms, the UniProt web services can be queried to automatically acquire
each protein sequence. These can be fed to protein cellular location prediction algorithms such as PSORT14 in order to
predict subcellular location of the enzyme, and by implication, the reaction(s) that it catalyses.
Future directions
While individual steps in the reconstruction process are amenable to automation, it is recognised that gap-filling, manual curation and validation
are essential steps in generating a high-quality reconstruction. Semantic annotations can further aid the validation process through automated
harvesting of chemical synonyms which can be fed to text-mining tools such as PathText15 in order to simplify the arduous, but necessary, task
of finding evidence for present (and missing) reactions in the literature.
1Applications of genome-scale metabolic reconstructions. Oberhardt MA, Palsson BĂ, Papin JA. Mol Syst Biol. (2009) 5:320
2A protocol for generating a high-quality genome-scale metabolic reconstruction. Thiele I, Palsson BĂ. Nat Protoc. (2010) 5, 93-121.
3High-throughput generation, optimization and analysis of genome-scale metabolic models. Henry CS, DeJongh M, et al. Nat Biotechnol. (2010) 28, 977-82.
4The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Hucka M, Finney A, et al. Bioinformatics. (2003) 19, 524-31.
5Minimum information requested in the annotation of biochemical models (MIRIAM). Le NovĂšre N, Finney A, et al. Nat Biotechnol. (2005) 23, 1509-15.
6A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. HerrgÄrd MJ, Swainston N, et al. Nat Biotechnol. (2008) 26, 1155-60.
7libAnnotationSBML: a library for exploiting SBML annotations. Swainston N, Mendes P. Bioinformatics. (2009) 25, 2292-3.
8ChEBI: a database and ontology for chemical entities of biological interest. Degtyarenko K, de Matos P, et al. Nucleic Acids Res. (2008) 36, D344-50.
9The Universal Protein Resource (UniProt) in 2010. UniProt Consortium. Nucleic Acids Res. (2010) 38, D142-8.
10http://sbml.org/Software/KEGG2SBML/
11The EcoCyc and MetaCyc databases. Karp PD, Riley M, et al. Nucleic Acids Res. (2000) 28, 56-9.
12http://www.iupac.org/inchi/
13PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Nakai K, Horton P. Trends Biochem Sci. (1999) 24, 34-6.
14The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics. Steinbeck C, Han Y, et al. J Chem Inf Comput Sci. (2003) 43, 493-500.
15PathText: a text mining integrator for biological pathway visualizations. Kemper B, Matsuzaki T, et al. Bioinformatics. (2010) 26, i374-81.