PubChem is a public repository maintained by the NIH that contains over 243 million substance descriptions, 97 million unique chemical structures, and over 264 million biological activity test results. It serves as both a large data archive and knowledgebase. Programmatic interfaces allow for automated retrieval and integration of PubChem data into virtual screening pipelines. PubChemRDF encodes the data as RDF triples, enabling local storage and integration with other datasets using semantic web technologies.
4. Lehigh Univ., Apr. 26, 2019
PubChem (https://pubchem.ncbi.nlm.nih.gov)
A “public” repository of information on small molecules
and their biological activities, developed and
maintained by the U.S. National Institutes of Health
(NIH).
Launched in 2004 as a part of the Molecular Libraries
Roadmap initiatives.
A key resource of chemical information for researchers
in the area of cheminformatics, chemical biology,
medicinal chemistry, and many others.
5. Lehigh Univ., Apr. 26, 2019
0.0
1.0
2.0
3.0
4.0
Jan
Apr
Jul
Oct
Jan
Apr
Jul
Oct
Jan
Apr
Jul
Oct
Jan
Apr
Jul
Oct
Jan
NumberofUsers
(millions)
Month
Number of Monthly Unique Users
(Jan 2015 - Mar 2019, interactive users only)
2017
PubChem Usage Statistics
2015 2016
3.5 million unique users per month at peak (October 2018)
2018
5
2019
6. Lehigh Univ., Apr. 26, 2019
Top 5 Chemistry Websites
1. acs.org
2. rsc.org
3. sigmaaldrich.com
4. pubchem.ncbi.nlm.nih.gov
5. cas.org
Source: https://www.alexa.com/topsites/category/Top/Science/Chemistry
6
PubChem is the only public website among them.
7. Lehigh Univ., Apr. 26, 2019
Dual role of PubChem
>600
Data
sources
PubChem The public
Data archive/repository:
Collects/maintains chemical information submitted by data
contributors.
Knowledgebase:
Provides (high-quality) data to the public.
8. Lehigh Univ., Apr. 26, 2019
Depositor-provided
Bioactivity test results
Unique chemical
structure extraction
through
Standardization
Depositor-provided
substance descriptions
Unique chemical structures
Activity of
tested
“substances”
Activity of “compounds”
derived from associated
“substances”
Data Contributors
Substance
deposition
Assay
deposition
Data organization in PubChem
9. Lehigh Univ., Apr. 26, 2019
PubChem (https://pubchem.ncbi.nlm.nih.gov)
PubChem contains:
• >243.9 million substance descriptions,
• >97.6 million unique chemical structures,
• >264.8 million biological activity test results,
• >1.3 million biological assays, covering >10,000 unique
protein sequence targets.
(Arguably) the largest corpus of
publicly available chemical information
from 600+ data sources.
(as of April 24, 2016)
10. Lehigh Univ., Apr. 26, 2019
10
Exploring PubChem
using the web interfaces
11. Lehigh Univ., Apr. 26, 2019
11
Note
(Almost) all tasks you can do using PubChem’s web
interfaces can be automated using its programmatic
interfaces.
12. Lehigh Univ., Apr. 26, 2019
12
PubChem Web Interfaces
Text Search
Structure Search
ID List Upload
Classification Browser
PubChem Docs
37. Lehigh Univ., Apr. 26, 2019
PubChem’s chemical space
Lipinski’s rule of 5
(drug-likeness)
Congreve’s rule of 3
(lead-likeness)
Lead
compounds
Drug
candidates
Modification
Mol. Wt.: < 500
H-Bond Donor ≤ 5
H-Bond Accepter ≤ 10
LogP ≤ 5
Mol. Wt.: < 300
H-Bond Donor ≤ 3
H-Bond Acceptor ≤ 3
LogP ≤ 3
Rotatable Bond ≤ 3
PSA ≤ 60
38. Lehigh Univ., Apr. 26, 2019
Lead-like
11.2 millions
(11%)
Drug-like
73.3 millions
(75%)
All compounds
97.6 millions
(100%)
PubChem’s chemical space
39. Lehigh Univ., Apr. 26, 2019
Ro5
73.3
millions
(75%)
Ro51
15.0
millions
(15%)
Ro52
6.5 millions
(7%)
Ro53
1.3 millions
(1%)
Ro54
0.3 millions
(~0%)
PubChem’s chemical space
Ro5 + Ro5-1 = 90%
40. Lehigh Univ., Apr. 26, 2019
Bioactivity Data in PubChem
Tested
3.4 millions
(3.50%)
Active
(AC 1 nM)
62 thousands
(0.06%)
Active
(1 nM < AC 1 M)
713 thousands
(0.73%)
Active
(others)
465 thousands
(0.47%)
Inactive
2.1 millions
(1.34%)
Not Tested
94.2 millions
(96.51%)
All Compounds
97.6 millions
(100.00%)
41. Lehigh Univ., Apr. 26, 2019
High-Throughput
Screening data
Literature-extracted
data
Bioactivity Data in PubChem
42. Lehigh Univ., Apr. 26, 2019
High-Throughput
Screening data
• From Molecular Libraries
Program and other HTS
projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluoresent
compounds)
• Often measured at single
concentration
Literature-extracted
data
Bioactivity Data in PubChem
43. Lehigh Univ., Apr. 26, 2019
High-Throughput
Screening data
• From Molecular Libraries
Program and other HTS
projects.
• Many inactives
• False hits
(e.g., aggregators,
autofluoresent
compounds)
• Often measured at single
concentration
Literature-extracted
data
• From manual curation or
data mining
• No (or few) inactives
• Provided by various
PubChem depositors
including:
ChEMBL,
PDBbind, BindingDB,
Guide to Pharmacology
Bioactivity Data in PubChem
44. Lehigh Univ., Apr. 26, 2019
• Manual extraction from peer-reviewed papers in journals in
medicinal chemistry and natural product
ChEMBL
• Experimental binding affinity data for biomolecular
complexes in the PDB
PDBBind
• Binding affinities, focusing chiefly on the interactions of
protein considered to be drug targets with drug like small
molecules
BindingDB
• Drug targets (G-protein-coupled receptors, ion channels, and
nuclear hormone receptors and their ligands
Guide to Pharmacology
Literature-extracted Bioactivity data
45. Lehigh Univ., Apr. 26, 2019
Annotations available in PubChem
DrugBank
Comprehensive information on FDA-approved and
investigational drugs
• drug indications,
• mechanism of action,
• target macromolecules,
• interactions with genes/proteins,
• ADMET, ……
Hazardous Substance Data Bank (HSDB)
Toxicological information on chemicals of interest
in environmental and human health
46. Lehigh Univ., Apr. 26, 2019
Annotations available in PubChem
Molecular Modeling Database (MMDB)
Protein-bound 3D structures (found in PDB)
Cambridge Structural Database (CSD)
3-D crystal structures
NLM’s Dailymed
Drug labeling information
47. Lehigh Univ., Apr. 26, 2019
Annotations available in PubChem
FDA
Orange book
Unique ingredient identifiers,
Pharmacologic Classes
EPA
Substance Registry Services
Chemical data collected under the:
o Toxic Substance Control Act
o Clean Air Act
48. Lehigh Univ., Apr. 26, 2019
Availability of compounds for subsequent
experiments
• Virtual screening hits should be synthesizable or
purchasable.
• PubChem contains “real” molecules (not “virtual”
molecules)
• At least one or more data contributors claim that
they have the compound and/or information about
it.
• Some of these data contributors are chemical
vendors (e.g., Sigma Aldrich).
49. Lehigh Univ., Apr. 26, 2019
Two important aspects of PubChem records
(in the context of “compound availability”)
Non-live compounds:
Not searchable although they exist.
No associated substances due to:
o Mistakenly submitted substances
o Incorrect information
o No intention to share
50. Lehigh Univ., Apr. 26, 2019
Two important aspects of PubChem records
(in the context of “compound availability”)
Non-live compounds:
Not searchable although they exist.
No associated substances due to:
o Mistakenly submitted substances
o Incorrect information
o No intention to share
Legacy designation:
No longer maintains their records up-to-date.
o Discontinued funding, low business priority, …
51. Lehigh Univ., Apr. 26, 2019
Compound patentability for IP protection
• PubChem provides patent link information for
compounds, provided by:
o IBM
o SureChEMBL (formerly, SureChem)
o NextMove Software
o SCRIPDB
o BindingDB
52. Lehigh Univ., Apr. 26, 2019
Compound patentability for IP protection
>336 million chemical-patent links
>6 million patent documents
>16 million unique chemical structures
• Covering patent documents published by:
o U.S. Patent and Trademark Office (PTO)
o European Patent Office (EPO)
o World Intellectual Property Organization (WIPO)
Compounds with patent links are annotated with
WIPO International Patent Classification (IPC)
information
53. Lehigh Univ., Apr. 26, 2019
Entrez Utilities
(E-Utils)
Power User
Gateway
(PUG)
PUG-SOAP
PUG-REST
PubChem
RDF REST
PUG-View
Programmatic access to PubChem for
Automation of VS pipeline
54. Lehigh Univ., Apr. 26, 2019
Conceptual Framework of a PUG-REST request
PubChem
Servers
User’s
Computer
① INPUT
Identifiers
(CIDs, SIDs, AIDs)
③ OUTPUT
Results in
a desired format
② OPERATION
with identifiers
All necessary information is encoded into a one-line URL.
55. Lehigh Univ., Apr. 26, 2019
Options specific to
some operations
http://pubchem.ncbi.nlm.nih.gov/rest/pug/<INPUT>/<OPERATION>/<OUTPUT>[?OPTIONS]
Prolog
(common to all PUG REST requests)
<INPUT>
Specifies identifiers of interest,
by identifiers
by chemical name
by chemical structure search
by cross reference
by listkey, ......
<OPERATION>
Specifies what to do with input
get full records
get molecular properties
get synonyms or images
get cross references
many other operations
<OUTPUT>
Specifies desired output format
XML PNG
JSON SDF
JSONP CSV
ASNB TXT
ASNT
URL construction for a PUG-REST request
http://...... /compound/cid/2244,1983/record/XML?record_type=3d
Retrieve in XML full records for CIDs 2244 and 1983, including 3-D
structure description.
All necessary information is encoded into a one-line URL.
56. Lehigh Univ., Apr. 26, 2019
Request volume limitations
PUG-REST is NOT designed for very large volumes of
requests
(e.g. millions of requests)
Any script or application should not make more than five
requests per second to avoid overloading the PubChem
servers.
If you have a large data set to process, please contact us
for help on optimizing your task.
57. Lehigh Univ., Apr. 26, 2019
PubChemRDF
Encodes PubChem information using RDF.
Helps researchers work with PubChem data on local
computing resources using semantic web technologies.
harnesses ontological frameworks to help facilitate
PubChem data sharing, analysis, and integration with
resources external to the National Center for
Biotechnology (NCBI) and
across scientific domains.
RDF = Resource Description Framework
58. Lehigh Univ., Apr. 26, 2019
Programmatic access to PubChem RDF data is also available
through a REST-ful interface.
PubChemRDF for data exchange and integration
PubChemRDF
From FTP
Triple Store
Apache Jena TDB,
Open-Link
Virtuoso,
……
RDF-aware
graph
Databases
Neo4j, …
SPARQL
Query
Interface
Graph
Traversal
Algorithm
(RDF : Resource Description Framework)
60. Lehigh Univ., Apr. 26, 2019
Summary
• PubChem is the largest source of publicly
available chemical information, collected from
more than 600 data sources.
• In addition to bioactivity data generated through
high-throughput screenings, PubChem contains a
substantial amount of bioactivity information
extracted from scientific articles.
• Chemical vendor and patent information for
compounds in PubChem helps prioritize hit
compounds for further screening.
61. Lehigh Univ., Apr. 26, 2019
Summary
• PubChem supports programmatic access to its
data, allowing for building an automated virtual
screening pipeline.
• PubChemRDF allows users to download
PubChem data on a local computing facility and
integrate them with their own data.
• PubChem data can be used for developing
computational prediction models for bioactivity or
toxicity of molecules.
62. Lehigh Univ., Apr. 26, 2019
References
Getting the most out of PubChem for virtual screening
Expert Opin. Drug Discov. 2016, 11(9), 843-855.
PubChem 2019 update: improved access to chemical data
Nucleic Acids Res (2019) 47(D1):D1102-1109.
PubChem Substance and Compound Databases
Nucleic Acids Res (2016) 44 (D1): D1202-D1213.
An update on PUG-REST: RESTful interface for programmatic
access to PubChem
Nucleic Acids Res (2017) 46 (W1): W563-570.
PUG-SOAP and PUG-REST: web services for programmatic access
to chemical information in PubChem
Nucleic Acids Res (2015) 43 (W1): W605-W611
63. Lehigh Univ., Apr. 26, 2019
Acknowledgements
The PubChem Team
Funding from the National Library of Medicine
63
PubChem data contributors and users
Evan Bolton Jia He Thiessen Paul
Jie Chen Siqian He Bo Yu
Tiejun Chung Qingliang Li Leonid Zaslavsky
Asta Gindulyte Ben Shoemaker Jian Zhang
64. Lehigh Univ., Apr. 26, 2019
Thank you for your attention.
Questions?
Email: sunghwan.kim@nih.gov
kimsungh@ncbi.nlm.nih.gov