Presented at the American Chemical Society Middle Atlantic Regional Meeting (MARM) 2021 (June 10, 2021).
==== Abstract ====
With the emergence of the age of big data and artificial intelligence, biomedical research communities have a great interest in exploiting the massive amount of chemical and biological data available in the public domain. PubChem (https://pubchem.ncbi.nlm.nih.gov) is one of the largest sources of publicly available chemical information, with +270 million substance descriptions, +110 million unique compounds, +285 million bioactivity outcomes from more than one million biological assay experiments. PubChem provides a wide range of chemical information, including structure, pharmacology, toxicology, drug target, metabolism, chemical vendors, patents, regulations, clinical trials, and many others. These contents can be accessed interactively through web browsers as well as programmatically using computer scripts. They can also be downloaded in bulk through the PubChem File Transfer Protocol (FTP) site. PubChem data has been used in many studies for developing bioactivity and toxicity prediction models, discovering polypharmacologic (multi-target) ligands, and identifying new macromolecule targets of compounds (for drug-repurposing or off-target side effect prediction). This presentation provides an overview of PubChem data, tools, and services useful for drug discovery.
3. ▪ https://pubchem.ncbi.nlm.nih.gov
▪ Public chemical chemical database at NIH.
▪ Contains information on various chemical entities:
• Small molecules
• siRNAs & miRNAs
• Carbohydrates
• Lipids
• Peptides
• Chemically modified macromolecules
• ……
PubChem is a Public Chemical Information Resource
4. PubChem is a Data Aggregator
PubChem Sources: https://pubchem.ncbi.nlm.nih.gov/sources
Gov’t
agencies
Academic
institutions
Publishers
Pharma
companies
Chemical
vendors
Scientific
databases
790+ Data sources Users
o Biomedical Researchers
• Chemical biology
• Medicinal chemistry
• Drug design & discovery
• Cheminformatics
o Data scientists
o Patent agents/examiners
o Chemical safety officers
o Educators/librarians
o Students
35. Gene/Protein Target Page
➢ Suppose that you want to:
o Retrieve ALL active compounds
against a given protein/gene target
(e.g., HMGCR=3-hydroxy-3-methylglutaryl-CoA reductase).
• To identify common chemical scaffolds responsible for bioactivity.
• To build a quantitative structure-activity relationship (QSAR) model.
→Gene/Protein Target page
• Provides a target-centric view of PubChem data.
• Organizes all data available in PubChem for a given gene/protein.
43. Patent View Page
➢ Suppose that you want to:
o Retrieve ALL chemicals mentioned in a given patent document.
→Patent View page
• Provides a list of chemicals “mentioned” in the patent application/grant.
• No information on why they are mentioned.
(e.g., as a subject matter or as a prior art?)
• Other information, including:
- Title, abstract, date, inventor, …
- International patent classification (IPC) codes
51. ➢ PubChem users have very diverse
backgrounds/interests.
➢ PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
52. ➢ PubChem users have very diverse
backgrounds/interests.
➢ PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
➢ Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.
53. ➢ PubChem users have very diverse
backgrounds/interests.
➢ PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
➢ Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.
➢ Programmatic access enables one to do
much more complicated tasks that cannot
be done through the web browser.
54. ➢ Multiple programmatic access routes
➢ Two major programmatic access methods
o PUG-REST (primarily for computed properties).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest
o PUG-View (primarily for text information).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-view
➢ Request volume limitation:
o No more than 5 requests per second
(See more at: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-
access$_RequestVolumeLimitations)
o Violators/abusers may be blocked for a certain period of time.
Entrez Utilites
(E-Utils)
Power User
Gateway
(PUG)
PUG-SOAP PUG-REST
PubChem
RDF REST
PUG_View
55. ➢ Bulk Download
o PubChem FTP Site
ftp://ftp.ncbi.nlm.nih.gov/pubchem
o PubChem RDF (Resource Description Network)
https://pubchemdocs.ncbi.nlm.nih.gov/rdf
59. Tox21
(AID 1159531)
Training
(4916 compounds)
Test
(547 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
90% 10%
Data sets
60. Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
61. Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
All data
Available in
PubChem.
62. Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
63. Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
471
Data sets
64. ❑ Molecular descriptors
• Generated using PaDEL [Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474]
Model Building
Abbreviation Name Length
AP AtomPairs 2D Fingerprint 780
ESTAT Estate fingerprint 79
EXTFP* CDK Extended Fingerprint 1,024
FP* CDK fingerprint 1,024
GOFP* CDK graph only fingerprint 1,024
KR Klekota-Roth fingerprint 4,860
MACCS MACCS fingerprint 166
PUB PubChem fingerprint 881
SUB Substructure fingerprint 307
* Hashed fingerprints
65. ❑ Machine-learning algorithms (implemented in scikit-learn)
Abbreviation Name Hyperparameters optimized
NB Naï
ve Bayes (10-10 ~ 1)
DT Decision tree max_depth_range (3 ~ 7)
min_samples_split_range (3 ~ 7)
min_samples_leaf_range (2 ~ 6)
kNN K-Nearest neighbors weights (uniform, minkowski, jaccard)
n_neighbors (1 ~ 25)
RF Random forest n_estimators (10 ~ 200)
SVM Support vector machine C ( 2-10 ~ 210); ( 2-10 210)
NN Neural network solver (lbfgs or adam); (10-7 107)
▪ 10-fold cross-validation was used for hyperparameter optimization.
Model Building
66. Model Performance Evaluation
▪ Area under the Receiver operating characteristic curve (AUC)
→ Used for hyperparameter optimization.
▪ 𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝐵𝐴𝐶𝐶
=
1
2
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
+
𝑇𝑁
𝑇𝑁 + 𝐹𝑃
=
1
2
𝑆𝐸𝑁𝑆 + 𝑆𝑃𝐸𝐶
▪ 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝑆𝐸𝑁𝑆) =
𝑇𝑃
𝑇𝑃+𝐹𝑁
▪ 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑆𝑃𝐸𝐶 =
𝑇𝑁
𝑇𝑁+𝐹𝑃
67. ❑ Performance of the models
▪ AUC scores of 0.7 were observed for models
developed using:
PubChem/MACCS/CDK-FP with
NN/SVM/RF/kNN
▪ Maximum AUC score (0.77):
PubChem fingerprint with RF
▪ Similar trend was observed for the
performance in terms of BACC scores
(not shown here).
Area under ROC curve (AUC)
Model Performance Evaluation
68. Area under ROC curve (AUC), Inactive-to-active ratio = 1
NCGC
ChEMBL
General applicability of the models
69. • PubChem is the largest source of publicly available chemical
information, collected from hundreds of data sources.
• In addition to bioactivity data generated through high-throughput
screenings, PubChem contains a substantial amount of bioactivity
information extracted from scientific articles.
• Chemical vendor and patent information for compounds in
PubChem helps prioritize hit compounds for further screening.
Summary
70. • PubChem supports programmatic access to its data, allowing for
building an automated virtual screening pipeline.
• PubChemRDF allows users to download PubChem data on a
local computing facility and integrate them with their own data.
• PubChem data can be used for developing computational
prediction models for bioactivity or toxicity of molecules.
Summary
71. Acknowledgements
▪ The PubChem Team
▪ Funding
Evan Bolton Jia He Thiessen Paul Zhi Sun
Jie Chen Siqian He Bo Yu
Tiejun Chung Qingliang Li Leonid Zaslavsky
Asta Gindulyte Ben Shoemaker Jian Zhang
Intramural Research Program of the National Library of Medicine
72. Thank you for your attention.
Questions?
Sunghwan Kim (kimsungh@ncbi.nlm.nih.gov)
73. ❑ References
▪ Getting the most out of PubChem for virtual screening
S. Kim, Expert Opin. Drug Discov. 2016, 11(9), 843-855.
▪ PubChem in 2021: new data content and improved web interfaces
S. Kim et al., Nucleic Acids Res. 2021, 49(D1):D1388–D1395.
▪ An update on PUG-REST: RESTful interface for programmatic access to
PubChem
S. Kim et al., Nucleic Acids Res. 2018, 46(W1):W563-W570.
▪ PUG-SOAP and PUG-REST: web services for programmatic access to
chemical information in PubChem
S. Kim et al., Nucleic Acids Res. 2015, 43(W1):W605-W611.