This is a poster for the UK ELXIR meetin in Birmingham UK, Nov 2018. It is the summary of a blog-post https://cdsouthan.blogspot.com/2018/08/an-initial-look-at-elixir-chemistry.html that asses chemistry <> protein <> papers connectivity (C-P-P) for five ELIXIR resources
Looking at chemistry - protein - papers connectivity in ELIXIR
1. Chemistry counting across databases
Chemistry totals counting in UniChem
1 Centre for Discovery Brian Sciences, University of Edinburgh, Edinburgh, UK. 2 (currently) TW2Informatics Ltd, Göteborg, Sweden, cdsouthan@gmail.com
Assessing chemistry <> proteins <> papers
connectivity between ELIXIR resources
Introduction
C
As we know, the utility of ELIXIR is largely determined by connectivity and
interoperability. This can be expressed in different ways including the ability to
computationally query across the same entities between resources and the simple
provision of cross-pointers as live URLs for users to manually navigate between entity
records from different databases.
So how is ELXIR doing in this respect? This has been addressed in a blog post
https://cdsouthan.blogspot.com/2018/08/an-initial-look-at-elixir-chemistry.html that
asses chemistry <> protein <> papers connectivity (C-P-P). The should be consulted
for details since only an outline can be presented in this poster. The starting point was
our own UK ELIXIR resource of the IUPHAR/BPS Guide to PHARMACOLOGY
(GtoPdb) that includes C-P-P capture (see poster by Harding et al. and
http://www.guidetopharmacology.org/). We offer users outlinks and intersects of our
proteins via UniProt cross-references and updated our chemistry in PubChem and
UniChem. However, entity overlaps with other ELIXIR resources offer crucial
complementarity for users. Those compared for curated C-P-P are GtoPdb, ChEMBL,
ChEBI, PDBe, and most recently BRENDA, (excepting ChEBI that auto-maps C-P)
From the pre-computed chemistry intersects that UniChem generates at each release
one can plot informative comparative overlaps. The blog-post shows all five of these
but the example for GtoPdb is shown above. The pattern of overlaps has been
described in our NAR paper (PMID: 29149325). Note this is highest for PubChem
because we are a submitting source but there are minor chemistry rule differences.
Protein intersects
The easiest way to intersect proteins is via the UniProt cross-references, although
these are not available for ChEBI. The Venn diagram above shows selections of
Human Swiss-Prot x-refs for the other four sources. Some of the divergence is
explicable (e.g. the three sources do not curate PDB proteins that have no reported
chemical interactions). Note also the mappings are not all for small-molecules (e.g.
the ChEMBL and GtoPdb x-refs include antibody and large peptide interactions).
Unique or 2-way overlaps can be cross-curation opportunities to increase coverage.
Christopher Southan1,2
Publication intersects in European PubMed Central (EPMC)
For curated C-P-P resources it is useful to compare which papers have been selected
for chemistry extraction (even though its more difficult to discern “why”). In EPMC the
Data Links and Data Citations queries (HAS_CHEMBL:y) and (HAS_PDB:y) worked
cleanly. However, there was some ambiguity for (HAS_CHEBI:y). It turns out,
unfortunately, these are papers where there is a term match to ChEBI entries but not
papers that they curated to extract their chemical entries from. Neither GotPdb nor
BRENDA are current data links (GtoPdb intend to address this but in the interim lists of
papers they have curated chemistry from can be obtained via PubMed > PubMed). The
curation selectivity underlying the capture divergence is worthy of further investigation.
Chemistry intersects in PubChem
PubChem offers powerful “slice ‘n dice” options to compare 600+ sources. Of our five,
BRENDA and PDBe are not submitters but we can use the NCBI Structure (ligands
extracted from PDB) to substitute for the latter (n.b. 4-way Venn intersects are difficult
from the interface so only a 3-way is shown). Reasons for the wide divergence of
ELIXIR chemistry seen above can be partially but not entirely explained (see blog-post).
Conclusions
• This intra-ELIXIR comparative analysis was more difficult that in should have been
• One reason is that these databases have independently diverged over decades into
their utility niches with little (pre-ELIXIR) consideration of interoperability
• The exercise turned out to be peculiarly “gapped” in that it was not possible to do
standardized C-P-P x-mappings between all five, there was always at least one odd-
man-out
• Some of this could be easily addressed, for example that C-P for GtoPdb, ChEBI and
BRENDA get PMIDs indexed in EPMC for the papers they curated/extracted
• Another enhancement would be to harmonise chemistry submissions to both
UniChem and PubChem (e.g. for PDBe ligands and BRENDA compounds)
• The 37% unique chemistry in BRENDA may represent valuable capture but this
needs to be checked
• More technical dialogue between ELIXIR resources with entities-in-common would
be valuable (e.g. to cogitate on causes of divergent capture, pragmatic
interoperability assessments, collaborative curation and future RDF cross-testing)
• The C-P-P is extendable (e.g. for the new ELIXIR 3D-BioInfo imitative
• While ELIXIR Training is progressing and resources have good Help and FAQ these
results indicate an unmet need for “comparative exploitation guides” even for just C-
P-P. For example users need to know not only “what's in one but not t’other and
why?” but also “which permutations of these five, and/or others, should I use for
what?” (for chemistry see PMID: 29451740)
The EBI UniChem database provides
chemical structure cross-indexing between
39 sources that include the five compared
here. For comparison PubChem,
SureChEMBL (patents) and Human
Metabolites (HMDB) are shown on the right.
Counts refer to InChIKeys. The % unique
are for that source from the 128 million in the
11 Nov release that includes PubChem
(some are slightly different from the August
blog-post). This unique content is significant
for BRENDA, HMDB and PDBe.