presentation

CREDO: A comprehensive resource for Structural
Interactomics and Drug Discovery
Adrian Schreyer
Department of Biochemistry, University of Cambridge
Adrian Schreyer (Department of Biochemistry, University of Cambridge)The CREDO Database 1 / 46

Outline of the talk
1 Introduction

Introduction
What is CREDO?
(Very) brief summary
Contains the interactions between all molecules found in
experimentally-determined biological assemblies
Also contains intramolecular interactions of these molecules
Contacts are represented as Structural Interaction Fingerprints (SIFts)
Contains a sequence-to-structure mapping to integrate protein
sequence data
External resources are integrated to annotate data in CREDO
Complete cheminformatics toolkits (OpenEye, RDKit)
Python Application-Programming Interface (API)

Introduction
Database statistics
From CREDO release 2013.1.2
86,903 PDB entries
128,776 biological assemblies
607,505 protein-ligand interactions (not the total number of small
molecules)
266,062 protein-protein interfaces, 17,793 protein-nucleic acid grooves
20 carbohydrate chains!
1,166,380,424 contacts

Structural interactions Structural Interaction Fingerprints (SIFts)
Outline
2 Structural interactions
Structural Interaction Fingerprints (SIFts)
Aromatic ring interactions
Ligand-ligand interactions
Data Validation

Structural interactions Structural Interaction Fingerprints (SIFts)
Atom and contact types
Atom types are identiﬁed using SMARTS patterns
Contact types are assigned based on a combination of atom types and
geometrical constraints which have to be fulﬁlled
Charges (ionisation states) are not required to determine ionic
contacts
Multiple contact types possible but at least one type must be present
12 interatomic interaction types
9 ring-ring interaction geometries
4 ring-atom interaction types

Structural interactions Aromatic ring interactions
Outline
Data Validation

Aromatic ring interaction geometries

Atom-aromatic ring interactions
pi-electrons as atom type
Delocalised π-electron cloud of aromatic ring systems creates negative
charge on both faces
Can act as hydrogen bond acceptor and negatively ionisable group
Distance- and geometry-dependent
Interaction types
π-donor: with hydrogen bond donors
π-cation: with positively ionisable groups
π-carbon: with weak hydrogen bond donors
π-halogen: weak hydrogen bonds with halogens in a head-on
orientation

Pi-donor example from a drug-target interaction
Human aldose reductase mutant V47I complexed with ﬁdarestat (PDB entry: 2PD9)

Structural interactions Ligand-ligand interactions
Outline
Data Validation

Inhibition of Quinone Reductase by Imatinib
The structure of the leukemia drug imatinib bound to human quinone reductase 2 (PDB entry:
3FW1)

Small molecule dimer blocking the p53-MDM2 interaction
Structure of hDM2 with Dimer-Inducing Indolyl Hydantoin RO-2443 (PDB entry: 3VBG)

Structural interactions Data Validation
Outline
Data Validation

Validation of structural properties
Structural properties
All atomic data is retained (b-factors, occupancies)
Boolean flags to identify missing/disordered/clashing residues and
atoms
Boolean flags to identify non-standard, modified and mutated amino
acids
Additional properties from mmCIF: resolution, r-factor, r-free, pH
Ligand geometry (angles) can be problematic

Precision of atomic coordinates
Diffraction-component precision index (DPI)
Introduced by Cruickshank to estimate the uncertainty of atomic
coordinates obtained by structural refinement of protein diffraction
data
Introduced to the virtual screening community by Goto
Goto’s formula to calculate DPI
σ(r, Bavg ) = 2.2N
1/2
atomsV 1/2
a N
−5/6
obs Rfree
Goto’s formula to calculate theoretical DPI limit
σ(r, Bavg ) = 0.22(1 + s)1/2
V −1/2
m C−5/6
Rfreed
5/2
min

Missing regions of PDB residues
Visualisation of missing regions and a secondary structure fragment (PDB entry: 2P33)

Protein-ligand interactions Annotation of protein-ligand interactions
Outline
3 Protein-ligand interactions
Annotation of protein-ligand interactions
SIFt clustering

Annotating protein-ligand interactions
Metabolic pathways
EC information is mapped onto protein chains
KEGG data is used to identify metabolites and to link them to
enzymes
Ligands are labelled as substrate, product or cofactor (of the
enzyme)
Drug-target interactions
Approved drugs are identiﬁed as well as all other compounds in the
ChEMBL database
Biological target information (UniProt) is taken from ChEMBL and
DrugBank
Drug-target interactions are identiﬁed

Ligand affinities and efficiencies
Potency of ligands
Obtained from the latest version of the ChEMBL database
Identified through a combination of document (PubMed), target
(UniProt) and chemistry (UniChem) match
Binding activities and ligand efficiencies (pKd, BEI, SEI) are linked to
ligands where possible
6,848 unique activities for 6,505 unique ligands (28,943 pairs)

Protein-ligand interactions SIFt clustering
Outline
3 Protein-ligand interactions
Annotation of protein-ligand interactions
SIFt clustering

Clustering interaction ﬁngerprints
Structural properties
SIFts can be aligned to a given sequence system such as UniProt (or
structural alignments)
These alignments can be used for hierarchical clustering to compare
interactions
In CREDO this is done for all ligands that interact with proteins
2D and 3D similarities are calculated for terminal (leaf) nodes
(always contain two ligands)
Integrated into the website and API, phylogenetic trees can be
visualised and browsed dynamically

The SIFt tree for CDK2

Protein sequences and variations Sequence-to-structure mapping
Outline
4 Protein sequences and variations
Sequence-to-structure mapping
Structural variations aﬀecting PDB residues and their interactions
Binding site similarity searching

Mapping UniProt sequences to PDB chains
Structure integration with function, taxonomy and sequence
(SIFTS) initiative
Maps UniProt sequences onto PDB residue sequences
Provides further residue level annotation from the IntEnz, GO, Pfam,
InterPro, SCOP, CATH and Pubmed databases
Used to identify modiﬁed or mutated amino acids in protein chains
Contains secondary structure information for each residue
Transformed into relational format and linked to all residues in
CREDO

Protein Domains
Mapping protein domains onto protein chains
Protein domain classiﬁcations from Pfam, CATH and SCOP are
integrated into CREDO
Mapped to protein chains, ligand binding sites, protein-protein
interfaces etc.
Pfam has the largest coverage by far
5,724 unique Pfam domains

Secondary structure fragments
Implementing secondary structure fragments
The secondary structure information is used to create continuous
fragments of secondary structure elements (SSE) in protein chains
New fragment is identiﬁed after every change in secondary structure
in the sequence of a polypeptide chain
Tightly integrated with other CREDO entities
Easily possible to get all SSEs interacting with a ligand or across a
protein-protein interface
Potential application in the context of peptidomimetic drugs and
biologics

Protein sequences and variations Structural variations aﬀecting PDB residues and their interactions
Outline

Structural Variations in CREDO
Identifying variations in protein structures
Mapped onto residues in CREDO through sequence-to-structure
mapping
Can be easily queried and combined with other parameters
Linked to EnsEMBL disease phenotypes
2,369 phenotypes can be linked to residues in CREDO
Source databases included in EnsEMBL Variation
dbSNP
Catalogue Of Somatic Mutations In Cancer (COSMIC)
Online Mendelian Inheritance in Man (OMIM)
1000 Genomes

Relevance: drug resistance in cancer
C-KIT tyrosine kinase in complex with Imatinib (PDB entry: 1T46) with T670I
Imatinib-resistant mutation.

Protein sequences and variations Binding site similarity searching
Outline

FuzCav: Binding site similarity
The FuzCav algorithm
Alignment-free and very easy to calculate
Based on pharmacophore triplet count to describe a ligand binding
site
Can detect local similarities between binding sites
Performed natively on the server-side with PostgreSQL using
numerical extension (pgeigen)
Various similarity metrics can be used
Calculated for all binding sites in CREDO
Journal of Chemical Information and Modeling 2010 50 (1), 123-135

FuzCav: description of the algorithm

Chemistry and cheminformatics Molecular descriptors
Outline
5 Chemistry and cheminformatics
Molecular descriptors
RECAP fragmentation of chemical components
Cheminformatics

Chemistry and cheminformatics Molecular descriptors
Calculation of physicochemical properties
Conformation-independent
Important to evaluate drug-likeness and ﬁlter molecules
Feature counts, tPSA, XLogP, QED, ...
Conformation-dependent
Calculated for all bound ligands and their up to 200 modelled
conformers
Solvent-exluded and polar/apolar/total solvent-accessible surface
areas
Radius of gyration, Number of internal contacts
Ultrafast-Shape Recognition (USR) moments as well as USRCAT

Chemistry and cheminformatics RECAP fragmentation of chemical components
Outline
Cheminformatics

RECAP fragmention of chemical components
Implementation of the algorithm
The Retrosynthetic Combinatorial Analysis Procedure (RECAP) uses
predeﬁned bond types to cleave molecules into fragments
A hierarchical and exhaustive fragmentation implementation is used
in CREDO
Hierarchy stored in the database and linked to chemical components
New rules have been implemented to optimise fragmentation of
natural products and endogenous compounds
Existing rules have been extended (thioethers, thioesters,...)

Standard RECAP rules

RECAP fragments and ligands
Analysing fragment interactions
RECAP fragments are mapped back onto the ligands and their atoms
of the original chemical components
Therefore it is possible to analyse interactions on the fragment level
Fragments can easily be ﬁltered by their interactions, e.g. contact
type or interactions with speciﬁc amino acids
CREDO currently contains two measures to assess the contribution of
a fragment to the interaction as a whole

Fragment Contact Density (FCD)
New measure to calculate fragment contributions
Do all ligand fragments form an equal number of contacts or a single
fragment dominate?
Ratio between the number of contacts divided by the number of
atoms for both the fragment and the whole ligand
Number of contacts is simply the number of protein atoms within
4.5Å of the fragment
Simple formula to calculate the Fragment Contact Density
FCD =
NFragment
Contacts /NFragment
Heavy atoms
NLigand
Contacts/NLigand
Heavy atoms

Visualisation of the FCD
Cysteine aspartyl protease-3 (caspase-3) in complex with a non-peptidic inhibitor (PDB entry:
1NMQ)

Chemistry and cheminformatics Cheminformatics
Outline
Cheminformatics

pgopeneye: database cartridge for cheminformatics
Cheminformatics extension based on the OpenEye toolkits
Implements commonly used cheminformatics routines
Substructure, topological similarity, SMARTS, Murcko scaﬀolds, etc.
Supports I/O of SMILES, SDF, OEB, IUPAC
Fingerprint similarity metrics use SSE (POPCNT)
Fingerprints can be indexed (GIST): 1.2M ﬁngerprints, ordered result
in less than 100 ms
Very fast MCS search: 6500 structures < 100 ms (great with
ChEMBL)

USRCAT: real-time USR with pharmacophoric constraints
USRCAT: an extension of USR
USRCAT is an extension of Ultrafast Shape Recognition (USR) that
includes pharmacophoric information into the moments
Outperforms USR signiﬁcantly in a virtual screening benchmark
(using DUD-E)
Implemented natively into the database: can be used in any SQL
query (limit to speciﬁc family | include chemical graph similarity)
Average screening performance of 5.3M conformers (moments) per
second (including sorting)
Currently used with all PDB chemical components and ZINC
drug-like set (12M compounds, 200M+ conformers)

CREDO in the public domain
CREDO Web interface
Web interface
Can be used to browse and search data in CREDO
Biological assemblies can be visualised directly, including visualisation
of contacts and highlighting of mutations (WebGL)
Downloads of selected data sets, e.g. kinases
RESTful Web service
Most resources of the service can be queried programmaticly through
GET or POST requests

CREDO in the public domain
CREDO on the web
More information and updates
Web interface: http://www-cryst.bioc.cam.ac.uk/credo
Blog: http://blog.adrianschreyer.com
Twitter: http://twitter.com/credodb

presentation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie presentation

Ähnlich wie presentation (20)

presentation