Abstract
Pharos (https://pharos.nih.gov/) is an integrated web-based informatics platform for the analysis of data aggregated by the Illuminating the Druggable Genome (IDG) Knowledge Management Center, an NIH Common Fund initiative. The current version of Pharos (as of October 2019) spans 20,244 proteins in the human proteome, 19,880 disease and phenotype associations, and 226,829 ChEMBL compounds. This resource not only collates and analyzes data from over 60 high-quality resources to generate these types, but also uses text indexing to find less apparent connections between targets, and has recently begun to collaborate with institutions that generate data and resources. Proteins are ranked according to a knowledge-based classification system, which can help researchers to identify less studied “dark” targets that could be potentially further illuminated. This is an important process for both drug discovery and target validation, as more knowledge can accelerate target identification, and previously understudied proteins can serve as novel targets in drug discovery. In this webinar, Dr. Tudor Oprea will introduce how to use Pharos to find targets of interest for drug discovery.
The top 3 key questions that Pharos can answer:
1. What are the novel drug targets that may play a role in a specific disease?
2. What are the diseases that are related directly or indirectly to a drug target?
3. Find researchers that are related directly or indirectly to a drug target.
Presenter: Tudor Oprea, MD, PhD, Professor of Medicine, Chief of Translational Informatics Division & Internal Medicine, University of New Mexico
dkNET Webinar Information: https://dknet.org/about/webinar
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
dkNET Webinar: Illuminating The Druggable Genome With Pharos 10/23/2020
1. Tudor I. Oprea
University of New Mexico, Albuquerque NM
10/23/2020
dkNET: Connecting Researchers to Resources
Via Zoom Funding: NIH U24 CA224370 & NIH U24 TR002278
http://druggablegenome.net/
http://datascience.unm.edu/
2. 75% of protein research still
focused on 10% genes known
before human genome was mapped
AM Edwards et al, Nature, 2011
This prompted NIH to start the
Illuminating the Druggable Genome
Initiative
3. Informatics, Data Science and
Machine Learning (“AI”) can be
used as follows:
Diseases: EMR processing,
nosology, ontology, & EMR-based ML
Targets: drug target selection &
validation, phenotype associations,
ML
Drugs: Identifying novel therapeutic
modalities using in silico methods
IDG is developing methods
applicable to each of these 3 areas
8/24/20 revision
Diseases image credit: Julie McMurry, Melissa Haendel (OHSU).
All other images credit: Nature Reviews Drug Discovery cover page
4. 2/4/20 revisionR. Santos et al., Nature Rev.Drug Discov. 2017, 16:19-34 link
We curated 667 human genome-derived
proteins and 226 pathogen-derived
biomolecules through which 1,578 US FDA-
approved drugs act.
This set included 1004 orally formulated
drugs as well as 530 injectable drugs
(approved through June 2016).
Data captured in DrugCentral (link)
5. 2/4/20 revision
RFA-RM-16-026
(DRGC)
GPCRs
U24 DK116195:
Bryan Roth, M.D., Ph.D. (UNC)
Brian Shoichet, Ph.D. (UCSF)
Ion
Channels
U24 DK116214:
Lily Jan, Ph.D. (UCSF)
Michael T. McManus, Ph.D. (UCSF)
Kinases
U24 DK116204:
Gary L. Johnson, Ph.D. (UNC)
RFA-RM-16-025
(RDOC)
Outreach
U24 TR002278:
Stephan C. Schürer, Ph.D. (UMiami)
Tudor Oprea, M.D., Ph.D. (UNM)
Larry A. Sklar, Ph.D. (UNM)
RFA-RM-16-024
(KMC) Data
U24 CA224260:
Avi Ma’ayan, Ph.D. (ISMMS)
U24 CA224370:
Tudor Oprea, M.D., Ph.D. (UNM)
RFA-RM-18-011
(CEIT)
Tools
U01 CA239106: N Kannan, PhD & KJ Kochut (UGA)
U01 CA239108: PN Robinson, MD PhD (JAX), CJ Mungall
(LBL), T Oprea (UNM)
U01 CA239069: G Wu, PhD (OHSU), PG D’Eustachio PhD
(NYU), Lincoln D Stein, PhD (OICR)
T. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link
6. Most protein classification schemes are
based on structural and functional criteria.
For therapeutic development, it is useful to
understand how much and what types of
data are available for a given protein,
thereby highlighting well-studied and
understudied targets.
Tclin: Proteins annotated as drug targets
Tchem: Proteins for which potent small
molecules are known
Tbio: Proteins for which biology is better
understood
Tdark: These proteins lack antibodies,
publications or Gene RIFs
T. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link 2/10/20 revision
2020 Update: Tdark 31.2%;Tbio 57.7%;Tchem 8%;Tclin 3.1%
9. Tclin proteins are associated
with drug Mechanism of Action
(MoA) – NRDD 2017
Tchem proteins have
bioactivitis in ChEMBL and
DrugCentral, + human curation
for some targets
Kinases: <= 30nM
GPCRs: <= 100nM
Nuclear Receptors: <= 100nM
Ion Channels: <= 10μM
Non-IDG Family Targets: <= 1μM
10/19/16 revision
Bioactivities of approved drugs (by Target class)
ChEMBL: database of bioactive chemicals
https://www.ebi.ac.uk/chembl/
DrugCentral: online drug compendium
http://drugcentral.org/
R. Santos et al., Nature Rev.Drug Discov. 2017, 16:19-34 link
10. Tbio proteins lack small molecule annotation cf.Tchem criteria,
and satisfy one of these criteria:
protein is above the cutoff criteria for Tdark
protein is annotated with a GO Molecular Function or Biological Process
leaf term(s) with an Experimental Evidence code
protein has confirmed OMIM phenotype(s)
Tdark (“ignorome”) have little information available, and satisfy
these criteria:
PubMed text-mining score from Jensen Lab < 5
<= 3 Gene RIFs
<= 50 Antibodies available according to antibodypedia.com
8/20/15 revisionT. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link
11. Tdark parameters differ from the other TDLs across the 4 external
metrics cf.Kruskal-Wallis post-hoc pairwise Dunn tests
2/23/18 revisionT. Oprea et al., Nature Rev.Drug Discov. 2018, 17:317-332 link
12. https://rpubs.com/
cbologa/TDL7
Tdark:
9199 proteins in 2013
7658 proteins in 2016
6368 proteins in 2020
Tclin:
601 proteins in 2013
592 proteins in 2016
659 proteins in 2020
10/12/20 revisionT. Sheils, S.L. Mathias et al., Nucleic Acids Research 2021 doi:10.1093/nar/gkaa993
13. T. Sheils, S.L. Mathias et al., Nucleic Acids Research 2021 doi:10.1093/nar/gkaa993 10/12/20 revision
14. 2/4/20 revisionHaendel M, et al. Nature Rev.Drug Discov. 2020 19:77-78 link
We revised the number of RDs from ~7,000 to
10,393 using Disease Ontology, OrphaNet,
GARD, NCIT, OMIM and the Monarch
Initiative MONDO system
We also pointed out the lack of a uniform
definition for rare diseases, and called for
coordinated efforts to precisely define them
We surveyed therapeutic modalities
available to translate advances in the
scientific understanding of rare diseases into
therapies, and discussed overarching issues
in drug development for rare diseases.
15. Tambuyzer E, et al. Nature Rev.Drug Discov. 2020 19:93-111 link 2/4/20 revision
16. 6077 human proteins are associated
with at least one Rare Disease.
Sources: Disease Ontology (RD-slim),
eRAM and OrphaNet
~50% agreement (gene level)
Contrast:Tclin at 3% & Tchem at 7%
overall vs. RD subset: 6.94% Tclin and
14.1% for Tchem.
20% of the RD proteome is Tclin &
Tchem. This means hope for cures.
Potentially significant opportunities for
target & drug repurposing.
2/4/20 revisionTambuyzer E, et al. Nature Rev.Drug Discov. 2020 19:93-111 link
17. 3/12/18 revision
~35% of the proteins remain
poorly described (Tdark)
~11% of the Proteome (Tclin & Tchem) are currently targeted by
small molecule probes
With help from rare disease patient advocacy groups, rare disease
research is likely to witness a significant increase in translation
18. IN GOD WE TRUST.
All others bring Data.
Quote attributed to W. Edwards Deming, controversial:
Other attributions: George A. Box and Robert W. Hayden.
Bernhard Fisher, MD has said this to a journalist
19. https://pharos.nih.gov/targets/KCNJ11
The IDG KMC tracks 11 information
channels for protein-disease
associations, accessible via the
Pharos portal.
Our challenge is to harmonize
disease concepts, and to enable
computational use: e.g., KCNJ11 with
ABCC8 form the Sulfonylurea 1
Kir6.2 receptor, MoA drug target for
glibenclamide (type 2 diabetes).
10/23/20 revision
The challenge for ML & AI: How to prioritize targets? i.e., which protein-
disease associations are clinically actionable?
(involved is not the same as committed)
22. 9/09/20 revisionG. KC, G Bocci et al., Nature Machine Intell 2020, submitted link
We used data from the NCATS COVID19
portal to develop a suite of ML models for
six assays related to SARS-CoV-2 activities:
• viral entry (Spike/ACE2 via AlphaLISA;
counterscrens TruHit & ACE2 inhibition)
• viral replication (3CL or Mpro)
• live virus infectivity (CPE & cytotoxicity)
REDIAL-2020 prediction workflow
Input: SMILES
Drug Name
PubChem CID
ML: Fingerprints
Pharmacophores
Phys-chem
based on:
RDKit
scikit-learn
External set predictions
a) CPE, 24 actives;
b) CPE, 14 actives;
c) 3CL, 6 actives.
http://drugcentral.org/Redial
23. 9/09/20 revisionG. KC, G Bocci et al., Nature Machine Intell 2020, submitted link
http://drugcentral.org/Redial
24. IDG KMC2 seeks knowledge gaps
across the five branches of the
“knowledge tree”:
Genotype; Phenotype; Interactions
& Pathways; Structure & Function;
and Expression, respectively.
We can use biological systems
network modeling to infer novel
relationships based on available
evidence, and infer new “function”
and “role in disease” data based
on other layers of evidence
Primary focus on Tdark & Tbio
O. Ursu,T Oprea et al., IDG2 KMC 2/01/18 revision
25. O. Ursu et al., manuscript in preparation
Data source Data type Data points
CCLE Gene expression 19,006,134
GTEx Gene expression 2,612,227
Protein Atlas Gene & Protein expression 949,199
Reactome Biological pathways 303,681
KEGG Biological pathways 27,683
StringDB Protein-Protein interactions 5,080,023
Gene ontology Biological pathways & Gene function 434,317
InterPro Protein structure and function 467,163
ClinVar Human Gene - Disease/Phenotype associations 881,357
GWAS Gene - Disease/Phenotype associations 54,360
OMIM Human Gene - Disease/Phenotype associations 25,557
UniProt Disease Human Gene - Disease/Phenotype associations 5,365
JensenLab DISEASE Gene - Disease associations from text mining 44,829
NCBI Homology Homology mapping of human/mouse/rat genes 70,922
IMPC Mouse Gene - Phenotype associations 2,153,999
RGD Rat Gene - Phenotype associations 117,606
LINCS Drug induced gene signatures 230,111,315
We developed automated
methods for data collection
(TCRD), visualization (Pharos)
and data aggregation.
These aggregated datasets
were used to build machine
learning models for 20+
disease and 73 mouse
phenotype.
Each knowledge graph
contains ~22,000 metapaths
and 284 million path instances.
10/07/18 revision
26. a meta-path is a path consisting of
a sequence of relations defined
between different object types
(i.e., structural paths at the meta
level)
Our metapaths encode type-
specific network topology
between the source node (e.g.,
Protein) and the destination node
(e.g., Disease).
This approach enables the trans-
formation of assertions/evidence
chains of heterogeneous
biological data types into a ML
ready format.
G. Fu et al., BMC Bioinformatics 2016, 17:160 is an early example for drug-target interactions 10/01/18 revision
Similar assertions or evidence form metapaths (white).
Instances of metapath (paths) are used to determine the strength of the
evidence linking a gene to disease/phenotype/function.
27. one protein-disease
association at the time
O. Ursu,T Oprea et al., IDG2 KMC 2/01/18 revision
Genes associated with a disease/phenotype are positive examples, whereas genes lacking the same
association are negative examples. The Metapath approach transforms assertions/evidence chains into
classification problems that can be solved using suitably designed machine learning algorithms.
28. All datasets are merged, via R
scripts, into a PostgreSQL.
Python under development.
Graph embedding transforms
evidence paths into vectors,
converting data into matrices.
Input genes are positive
labels. OMIM (not input) are
negative labels (we prefer true
negatives where possible).
XGBoost runs 100 models.The
“median model” (AUC, F1) is
then selected for analysis and
prediction to avoid overfitting.
10/15/19 revisionJ.J.Yang, P. Kumar, D. Byrd et al., IDG2 KMC
29. A soccer match at RoboCup, Nagoya 2017
Image searching for “Bad AI”
30. Build data matrix from “Alzheimer’s disease” in
TCRD subset
protein knowledge graph along metapaths:
Protein – Protein Interactions
Pathways
GO terms
Gene expression
...
Training set: 53 genes associated with
Alzheimer’s disease (positives); 3,952 genes
associated with other pathologies from OMIM
were assumed to be negative
Test set: 23 genes associated with Alzheimer's
(positives) and 200 genes not associated with
Alzheimer's (negatives) from Text Mining
“Complete forest” binary classifier using
XGBoost & 5-fold cross-validation.
Weighted model is better than balanced model
2/14/18 revisionML work by Oleg Ursu
Bal. Predicted
Actual
Pos Neg
Pos 16 7
Neg 94 106
Wtd Predicted
Actual
Pos Neg
Pos 20 3
Neg 41 159
31. The top most important features are interactions with
proteins mediating inflammatory processes (JAK2/Tclin,
IL10 & IL2 / Tchem), response to oxidative stress
(GSTP1/Tchem), nervous system development (BDNF/Tbio)
and glycolysis (GAPDH/Tchem).
LINCS drug-induced gene expression perturbations are
the largest category of features for these predictions.
Brain cortex expression is a necessary requirement.
One Reactome pathway (AU-rich mRNA elements binding
proteins) is also important.
Weighted approached showed better performance in the
test set for Alzheimer's Disease, Schizophrenia, and Dilated
Cardiomyopathy.
4/23/18 revisionML work by Oleg Ursu
32. We tested the top 20 genes identified
by PKG/m-p/ML with a high-
throughput validation system by
measuring AD-relevant
hyperphosphorylated (at
S199/S202/T205) tau protein (AT8-Tau
and AT180-Tau) using a Cellomics®
high-content microscope; as well as
gene expression and
immunochemistry analysis via human
AD induced pluripotent stem cells
and human AD brain tissue
8/24/20 revisionAD validation work by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement
33. 2/14/19 revisionAD validation work by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement
SHSY5Y’s in vitro
siRNA knock-downs
measuring ∆pTau
(AT8) levels –
unbiased cellomics
qPCR gene
expression
Human induced
pluripotent stem
cells derived into
neurons –AD vs Ctrl
A
K
N
A
B
C
O
2
C
C
N
Y
C
R
T
A
M
F
A
M
92B
F
O
X
P
4
F
R
R
S
1
G
R
IN
2C
IL
17R
E
L
L
IL
R
A
3
L
M
04
N
D
R
G
2
P
IB
F
1
R
A
B
40AS
C
G
B
3A
1S
L
C
44A
2
S
P
O
P
S
T
A
R
D
3
T
M
E
F
F
2T
X
N
D
C
12
0
1
2
2.5
5.0
7.5
FoldChange(2^-∆∆Ct)
RelativetoCtrl
AX0018
sAD2.1
*
****
**
**
**
****
**
*
**
*
****
****
****
****
****
****
*
35. 5/22/19 revisionAD validation work by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement
Top 20 Genes
predicted by the
XGBoost/Metapath
model, clustered by
functional roles
36. 8/24/20 revisionAD validation work by Jessica Binder & Kiran Bhaskar,funded by U24CA224370-S2 supplement
We proposed to validation ML models for the top 20 genes:
AKNA, BC02, CCNY,CRTAM, FAM92B, FOXP4, FRRS1, GRIN2C,1L17REL,
LILRA3, LM04, NDRG2, PIBF1, RAB40A, SCGB3A1, SLC44A2, SPOP,
STARD3,TMEFF2,TXNDC12
The most obvious effects based on the combined Cellomics & qPCR
of iPSNs & autopsy brains suggests that AKNA, LILRA3, PIBF1 and
TXNDC12 significantly increased pTau (as tracked by two different
antibodies for T180, S202 and S205)
PIBF1, LILRA3 and CRTAM show the most significant effect on tau
phosphorylation; two (CRTAM and LILRA3) novel genes are
implicated in innate immune pathways
37. ML work by Tudor Oprea
Genes 51
Source https://omim.org/entry/125853
AUC 0.72±0.02
1/16/19 revision
First model: 51 OMIM genes
associated with T2D vs. 3,954
OMIM genes associated with
other pathologies. AUC = 0.72 ±
0.08.
VIP-ranked variables include
HFE & HMOX1, which relate to
hemochromatosis (80% leads to
T2D), and IL1B & IL10 (suggests
an immune component).
38. From: Mark McCarthy <mark.mccarthy@drl.ox.ac.uk>
Sent: Friday, December 7, 2018 11:10 AM
The general summary is that we don’t see any enrichment for T2D associations
in either exome or GWAS data from the predicted gene sets (however we slice
them up).
But having that we don’t really see anything in the TRAINING set either: No
association in the exomes, and a weak (just nominal) association in the GWAS
data.
To be honest, I think, now we’ve taken a look at it, we’d all question the
training set: I had missed that this came from OMIM, which is simply not a
reliable source of information in this regard.
1/3/19 revision
39. ML work by Tudor Oprea
Genes 54
Source Causal T2DM transcripts
AUC 0.79±0.01
1/16/19 revision
• Second model: 54 causal transcripts
provided by Anuba Mahajan & Mark
McCarthy vs. 3,954 OMIM genes.
AUC = 0.79 ± 0.01.
Genes confirmed by GWAS (9 in
top 24): C2CD4B, C2CD4A,
JAZF1, ADAMTS9, CRY2,
LINGO2, THADA, TMEM18 &
SEC16B. 4 genes have GO terms
for insulin secretion: CPLX1,
ADRA2A, SYT7 & SYTL4
Top 4 VIP-ranked variables include
2 PPI nodes: SLC30A8 (rs13266634)
and GIPR (rs8108269), which have
GWAS-T2D associations.
41. Mackmyra tasked Microsoft and Fourkind to create novel
whisky recipes using AI
From input of 75 recipes,“AI” could generate 70 million
combinations.
Nr 36 on the AI ranked combinations was approved by
humans
https://www.geekwire.com/2019/microsoft-got-creation-worlds-first-whisky-formulated-ai/ 9/22/19 revision
42. How long does it take to move from “natural” language processing
to AI-driven large-dataset mining? Klingon, anyone? tlhIngan, vay'?
9/25/19 revision
Tomáš Mikolov (Google), developed an efficient algorithm to compute the
distributed representation of words, Word2Vec. It’s currently used for automatic
translation, spam filtering and speech recognition. Word2vec encodes words
using a distribution of weights across 100s of elements that compose the vectors.
Each element contributes to many words.
T. Mikolov et al.,ICLR 2013
10/10/19 revision
43. Alexahealth™: Given today’s health status and my calorie budget,
what food should I shop/prepare today?
Expanding on current models, IDG KMC could use AI/ML to integrate context-
specific computational reasoning tools (“AMI”) with /real time –omics,
biomarker and biomedical literature data.
These could be plugged into hospital / EMR data to improve patient services.
10/10/19 revision
44. 8/24/20 revision
Predictivity between different models for the same disease (even
using the same ML methods) may differ due to input variations
High quality data is really hard to obtain
Weakest components:
‘Ground Truth’ (true negatives) and Domain Expertise