SlideShare ist ein Scribd-Unternehmen logo
1 von 73
Downloaden Sie, um offline zu lesen
PubChem for Drug Discovery in the Age of
Big Data and Artificial Intelligence
Sunghwan Kim, Ph.D., M.Sc.
1. What is PubChem?
▪ https://pubchem.ncbi.nlm.nih.gov
▪ Public chemical chemical database at NIH.
▪ Contains information on various chemical entities:
• Small molecules
• siRNAs & miRNAs
• Carbohydrates
• Lipids
• Peptides
• Chemically modified macromolecules
• ……
PubChem is a Public Chemical Information Resource
PubChem is a Data Aggregator
PubChem Sources: https://pubchem.ncbi.nlm.nih.gov/sources
Gov’t
agencies
Academic
institutions
Publishers
Pharma
companies
Chemical
vendors
Scientific
databases
790+ Data sources Users
o Biomedical Researchers
• Chemical biology
• Medicinal chemistry
• Drug design & discovery
• Cheminformatics
o Data scientists
o Patent agents/examiners
o Chemical safety officers
o Educators/librarians
o Students
0
1
2
3
4
5
6
Unique
Monthly
Users
(millions)
Time
Monthly Usage Statistics
(Interactive Users Only)
Source: Google Analytics
▪ 5 million unique interactive users per month at peak (Oct. 2021)
▪ Programmatic requests are not included.
▪ These statistics are lower-bound.
2. What does PubChem have?
PubChem Data Content
Structures and properties
Structures and properties Spectra
PubChem Data Content
Structures and properties Spectra
Chemical
health & safety
PubChem Data Content
Structures and properties Spectra
Chemical
health & safety
Bioactivity
PubChem Data Content
Structures and properties Spectra
Chemical
health & safety
Bioactivity Chemical vendors & synthesis
PubChem Data Content
Drugs
PubChem Data Content
Clinical trials
Drugs
PubChem Data Content
Clinical trials
Patents
Drugs
PubChem Data Content
Clinical trials
Patents
Drugs
Scientific articles
PubChem Data Content
3. Navigating PubChem
https://pubchem.ncbi.nlm.nih.gov
https://pubchem.ncbi.nlm.nih.gov
https://pubchem.ncbi.nlm.nih.gov
34
Gene/Protein Target Page
➢ Suppose that you want to:
o Retrieve ALL active compounds
against a given protein/gene target
(e.g., HMGCR=3-hydroxy-3-methylglutaryl-CoA reductase).
• To identify common chemical scaffolds responsible for bioactivity.
• To build a quantitative structure-activity relationship (QSAR) model.
→Gene/Protein Target page
• Provides a target-centric view of PubChem data.
• Organizes all data available in PubChem for a given gene/protein.
https://pubchem.ncbi.nlm.nih.gov
https://pubchem.ncbi.nlm.nih.gov/#query=HMGCR&tab=gene
https://pubchem.ncbi.nlm.nih.gov/#query=HMGCR&tab=gene
https://pubchem.ncbi.nlm.nih.gov/gene/3156
https://pubchem.ncbi.nlm.nih.gov/gene/3156
https://pubchem.ncbi.nlm.nih.gov/gene/3156
https://pubchem.ncbi.nlm.nih.gov/gene/3156
Patent View Page
➢ Suppose that you want to:
o Retrieve ALL chemicals mentioned in a given patent document.
→Patent View page
• Provides a list of chemicals “mentioned” in the patent application/grant.
• No information on why they are mentioned.
(e.g., as a subject matter or as a prior art?)
• Other information, including:
- Title, abstract, date, inventor, …
- International patent classification (IPC) codes
https://pubchem.ncbi.nlm.nih.gov/#query=US2019183840
https://pubchem.ncbi.nlm.nih.gov/#query=US2019183840
https://pubchem.ncbi.nlm.nih.gov/patent/US2019183840
https://pubchem.ncbi.nlm.nih.gov/patent/US2019183840
https://pubchem.ncbi.nlm.nih.gov/patent/US2019183840
https://pubchem.ncbi.nlm.nih.gov/patent/US2019183840
4. Programmatic Access to
PubChem
➢ PubChem users have very diverse
backgrounds/interests.
➢ PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
➢ PubChem users have very diverse
backgrounds/interests.
➢ PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
➢ Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.
➢ PubChem users have very diverse
backgrounds/interests.
➢ PubChem’s web interfaces are optimized
to perform commonly requested tasks
interactively.
➢ Everything you can do with PubChem
through the web browser can be
automated through PubChem’s
programmatic interfaces.
➢ Programmatic access enables one to do
much more complicated tasks that cannot
be done through the web browser.
➢ Multiple programmatic access routes
➢ Two major programmatic access methods
o PUG-REST (primarily for computed properties).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest
o PUG-View (primarily for text information).
https://pubchemdocs.ncbi.nlm.nih.gov/pug-view
➢ Request volume limitation:
o No more than 5 requests per second
(See more at: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-
access$_RequestVolumeLimitations)
o Violators/abusers may be blocked for a certain period of time.
Entrez Utilites
(E-Utils)
Power User
Gateway
(PUG)
PUG-SOAP PUG-REST
PubChem
RDF REST
PUG_View
➢ Bulk Download
o PubChem FTP Site
ftp://ftp.ncbi.nlm.nih.gov/pubchem
o PubChem RDF (Resource Description Network)
https://pubchemdocs.ncbi.nlm.nih.gov/rdf
5. Showcase:
Bioactivity Prediction Model Building with
PubChem Data
▪ Involved in regulation of gene expression in
various biological processes.
▪ Potential roles in:
• metabolic signaling pathways
• skin alopecia (spot baldness)
• dermal cysts
• cardiac development
• insulin sensitization
• ……
▪ Let’s build binary classifiers (i.e, active vs.
inactive) for chemical modulators of RXRA
Retinoid X Receptor  (RXRA)
PDB ID: 1FBY
Tox21
(AID 1159531)
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
Data sets
Tox21
(AID 1159531)
Training
(4916 compounds)
Test
(547 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
90% 10%
Data sets
Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
All data
Available in
PubChem.
Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
Data sets
Tox21
(AID 1159531)
ChEMBL
(45 Assays)
NCGC
(2 Assays)
Training
(4916 compounds)
Test
(547 compounds)
External 1
(222 compounds)
External 2
(489 compounds)
• 471 actives
• 4,445 inactives
• 53 actives
• 494 inactives
• 205 actives
• 17 inactives
• 20 actives
• 469 inactives
Preprocessing
• Quantitative HTS (qHTS)
data for 10K compounds
• Predominantly inactive
• Data extracted from
journal articles
• Predominantly active
• qHTS data
• Predominantly inactive
• Some overlap w/ Tox21
Preprocessing Preprocessing
90% 10%
471
Data sets
❑ Molecular descriptors
• Generated using PaDEL [Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474]
Model Building
Abbreviation Name Length
AP AtomPairs 2D Fingerprint 780
ESTAT Estate fingerprint 79
EXTFP* CDK Extended Fingerprint 1,024
FP* CDK fingerprint 1,024
GOFP* CDK graph only fingerprint 1,024
KR Klekota-Roth fingerprint 4,860
MACCS MACCS fingerprint 166
PUB PubChem fingerprint 881
SUB Substructure fingerprint 307
* Hashed fingerprints
❑ Machine-learning algorithms (implemented in scikit-learn)
Abbreviation Name Hyperparameters optimized
NB Naï
ve Bayes  (10-10 ~ 1)
DT Decision tree max_depth_range (3 ~ 7)
min_samples_split_range (3 ~ 7)
min_samples_leaf_range (2 ~ 6)
kNN K-Nearest neighbors weights (uniform, minkowski, jaccard)
n_neighbors (1 ~ 25)
RF Random forest n_estimators (10 ~ 200)
SVM Support vector machine C ( 2-10 ~ 210);  ( 2-10  210)
NN Neural network solver (lbfgs or adam);  (10-7  107)
▪ 10-fold cross-validation was used for hyperparameter optimization.
Model Building
Model Performance Evaluation
▪ Area under the Receiver operating characteristic curve (AUC)
→ Used for hyperparameter optimization.
▪ 𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝐵𝐴𝐶𝐶
=
1
2
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
+
𝑇𝑁
𝑇𝑁 + 𝐹𝑃
=
1
2
𝑆𝐸𝑁𝑆 + 𝑆𝑃𝐸𝐶
▪ 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝑆𝐸𝑁𝑆) =
𝑇𝑃
𝑇𝑃+𝐹𝑁
▪ 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑆𝑃𝐸𝐶 =
𝑇𝑁
𝑇𝑁+𝐹𝑃
❑ Performance of the models
▪ AUC scores of 0.7 were observed for models
developed using:
PubChem/MACCS/CDK-FP with
NN/SVM/RF/kNN
▪ Maximum AUC score (0.77):
PubChem fingerprint with RF
▪ Similar trend was observed for the
performance in terms of BACC scores
(not shown here).
Area under ROC curve (AUC)
Model Performance Evaluation
Area under ROC curve (AUC), Inactive-to-active ratio = 1
NCGC
ChEMBL
General applicability of the models
• PubChem is the largest source of publicly available chemical
information, collected from hundreds of data sources.
• In addition to bioactivity data generated through high-throughput
screenings, PubChem contains a substantial amount of bioactivity
information extracted from scientific articles.
• Chemical vendor and patent information for compounds in
PubChem helps prioritize hit compounds for further screening.
Summary
• PubChem supports programmatic access to its data, allowing for
building an automated virtual screening pipeline.
• PubChemRDF allows users to download PubChem data on a
local computing facility and integrate them with their own data.
• PubChem data can be used for developing computational
prediction models for bioactivity or toxicity of molecules.
Summary
Acknowledgements
▪ The PubChem Team
▪ Funding
Evan Bolton Jia He Thiessen Paul Zhi Sun
Jie Chen Siqian He Bo Yu
Tiejun Chung Qingliang Li Leonid Zaslavsky
Asta Gindulyte Ben Shoemaker Jian Zhang
Intramural Research Program of the National Library of Medicine
Thank you for your attention.
Questions?
Sunghwan Kim (kimsungh@ncbi.nlm.nih.gov)
❑ References
▪ Getting the most out of PubChem for virtual screening
S. Kim, Expert Opin. Drug Discov. 2016, 11(9), 843-855.
▪ PubChem in 2021: new data content and improved web interfaces
S. Kim et al., Nucleic Acids Res. 2021, 49(D1):D1388–D1395.
▪ An update on PUG-REST: RESTful interface for programmatic access to
PubChem
S. Kim et al., Nucleic Acids Res. 2018, 46(W1):W563-W570.
▪ PUG-SOAP and PUG-REST: web services for programmatic access to
chemical information in PubChem
S. Kim et al., Nucleic Acids Res. 2015, 43(W1):W605-W611.

Weitere ähnliche Inhalte

Was ist angesagt?

PubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourcePubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourceSunghwan Kim
 
PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy trainingSunghwan Kim
 
Searching for patent information in PubChem
Searching for patent information in PubChem Searching for patent information in PubChem
Searching for patent information in PubChem Sunghwan Kim
 
PubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistryPubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistrySunghwan Kim
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
 
PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data ChemistrySunghwan Kim
 
Semantic Technology: The Basics
Semantic Technology: The BasicsSemantic Technology: The Basics
Semantic Technology: The BasicsPeter Berger
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingSunghwan Kim
 
Exploring Chemical and Biological Knowledge Spaces with PubChem
Exploring Chemical and Biological Knowledge Spaces with PubChemExploring Chemical and Biological Knowledge Spaces with PubChem
Exploring Chemical and Biological Knowledge Spaces with PubChemPaul Thiessen
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsSunghwan Kim
 
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...ChemAxon
 
PubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug DiscoveryPubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug DiscoverySunghwan Kim
 
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...Sunghwan Kim
 
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB
 
Searching for chemical information using PubChem
Searching for chemical information using PubChemSearching for chemical information using PubChem
Searching for chemical information using PubChemSunghwan Kim
 

Was ist angesagt? (20)

PubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information ResourcePubChem as an Emerging Toxicological Information Resource
PubChem as an Emerging Toxicological Information Resource
 
PubChem for chemical information literacy training
PubChem for chemical information literacy trainingPubChem for chemical information literacy training
PubChem for chemical information literacy training
 
Searching for patent information in PubChem
Searching for patent information in PubChem Searching for patent information in PubChem
Searching for patent information in PubChem
 
PubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data ChemistryPubChem: A Public Chemical Information Resource for Big Data Chemistry
PubChem: A Public Chemical Information Resource for Big Data Chemistry
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
PubChem and Big Data Chemistry
PubChem and Big Data ChemistryPubChem and Big Data Chemistry
PubChem and Big Data Chemistry
 
Semantic Technology: The Basics
Semantic Technology: The BasicsSemantic Technology: The Basics
Semantic Technology: The Basics
 
PubChem as a resource for chemical information training
PubChem as a resource for chemical information trainingPubChem as a resource for chemical information training
PubChem as a resource for chemical information training
 
Exploring Chemical and Biological Knowledge Spaces with PubChem
Exploring Chemical and Biological Knowledge Spaces with PubChemExploring Chemical and Biological Knowledge Spaces with PubChem
Exploring Chemical and Biological Knowledge Spaces with PubChem
 
Exploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural productsExploiting PubChem for drug discovery based on natural products
Exploiting PubChem for drug discovery based on natural products
 
PubChem LCSS
PubChem LCSSPubChem LCSS
PubChem LCSS
 
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
EUGM15 - George Papadatos, Mark Davies, Nathan Dedman (EMBL-EBI): SureChEMBL:...
 
PubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug DiscoveryPubChem and Its Applications for Drug Discovery
PubChem and Its Applications for Drug Discovery
 
Knowledge is Property- All YOU need to know ABC of Patent Searching
Knowledge is Property- All YOU need to know ABC of Patent SearchingKnowledge is Property- All YOU need to know ABC of Patent Searching
Knowledge is Property- All YOU need to know ABC of Patent Searching
 
ChemSpider Overview SLides August 2007
ChemSpider Overview SLides August 2007ChemSpider Overview SLides August 2007
ChemSpider Overview SLides August 2007
 
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
Cheminformatics Online Chemistry Course (OLCC): A Community Effort to Introdu...
 
Pubchem
PubchemPubchem
Pubchem
 
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
 
Searching for chemical information using PubChem
Searching for chemical information using PubChemSearching for chemical information using PubChem
Searching for chemical information using PubChem
 
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...PFAS Chemistry: Range, Complexity, Groupings, and the CompTox  Chemicals Dash...
PFAS Chemistry: Range, Complexity, Groupings, and the CompTox Chemicals Dash...
 

Ähnlich wie PubChem for drug discovery in the age of big data and artificial intelligence

ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...Dr. Haxel Consult
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Databasenist-spin
 
Supporting high throughput high-biotechnologies in today’s research environme...
Supporting high throughput high-biotechnologies in today’s research environme...Supporting high throughput high-biotechnologies in today’s research environme...
Supporting high throughput high-biotechnologies in today’s research environme...Ed Dodds
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseNathan Olson
 
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF Datasets
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF DatasetsBOUNCER: A Privacy-aware Query Processing Over Federations of RDF Datasets
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF DatasetsKemele M. Endris
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekData Driven Innovation
 
Patent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSPatent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSopen_phacts
 
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveVarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveGolden Helix
 
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveVarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveGolden Helix
 
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...Databricks
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...David Peyruc
 
CINF 55: SureChEMBL: An open patent chemistry resource
CINF 55: SureChEMBL: An open patent chemistry resourceCINF 55: SureChEMBL: An open patent chemistry resource
CINF 55: SureChEMBL: An open patent chemistry resourceGeorge Papadatos
 

Ähnlich wie PubChem for drug discovery in the age of big data and artificial intelligence (20)

Overview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data AnalysisOverview of Next Gen Sequencing Data Analysis
Overview of Next Gen Sequencing Data Analysis
 
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
 
Supporting high throughput high-biotechnologies in today’s research environme...
Supporting high throughput high-biotechnologies in today’s research environme...Supporting high throughput high-biotechnologies in today’s research environme...
Supporting high throughput high-biotechnologies in today’s research environme...
 
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseDevelopment of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Database
 
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF Datasets
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF DatasetsBOUNCER: A Privacy-aware Query Processing Over Federations of RDF Datasets
BOUNCER: A Privacy-aware Query Processing Over Federations of RDF Datasets
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekGenomic Big Data Management, Integration and Mining - Emanuel Weitschek
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
 
Patent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSPatent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTS
 
Overview of SureChEMBL
Overview of SureChEMBLOverview of SureChEMBL
Overview of SureChEMBL
 
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveVarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
 
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveVarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
VarSeq 2.4.0: VSClinical ACMG Workflow from the User Perspective
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
 
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Pfizer’s Recent Use of tr...
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
 
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
 
CINF 55: SureChEMBL: An open patent chemistry resource
CINF 55: SureChEMBL: An open patent chemistry resourceCINF 55: SureChEMBL: An open patent chemistry resource
CINF 55: SureChEMBL: An open patent chemistry resource
 
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
 

Mehr von Sunghwan Kim

PubChem as a resource for chemical information education
PubChem as a resource for chemical information educationPubChem as a resource for chemical information education
PubChem as a resource for chemical information educationSunghwan Kim
 
Chemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChemChemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChemSunghwan Kim
 
Chemical Structure Standardization and Synonym Filtering in PubChem
Chemical Structure Standardization and Synonym Filtering in PubChemChemical Structure Standardization and Synonym Filtering in PubChem
Chemical Structure Standardization and Synonym Filtering in PubChemSunghwan Kim
 
A Brief Overview of Cheminformatics
A Brief Overview of CheminformaticsA Brief Overview of Cheminformatics
A Brief Overview of CheminformaticsSunghwan Kim
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Sunghwan Kim
 
NCBI Minute: Integrating PubChem into Your Chemistry Teaching
NCBI Minute: Integrating PubChem into Your Chemistry TeachingNCBI Minute: Integrating PubChem into Your Chemistry Teaching
NCBI Minute: Integrating PubChem into Your Chemistry TeachingSunghwan Kim
 

Mehr von Sunghwan Kim (6)

PubChem as a resource for chemical information education
PubChem as a resource for chemical information educationPubChem as a resource for chemical information education
PubChem as a resource for chemical information education
 
Chemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChemChemical Health and Safety Information in PubChem
Chemical Health and Safety Information in PubChem
 
Chemical Structure Standardization and Synonym Filtering in PubChem
Chemical Structure Standardization and Synonym Filtering in PubChemChemical Structure Standardization and Synonym Filtering in PubChem
Chemical Structure Standardization and Synonym Filtering in PubChem
 
A Brief Overview of Cheminformatics
A Brief Overview of CheminformaticsA Brief Overview of Cheminformatics
A Brief Overview of Cheminformatics
 
Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...Development of machine learning-based prediction models for chemical modulato...
Development of machine learning-based prediction models for chemical modulato...
 
NCBI Minute: Integrating PubChem into Your Chemistry Teaching
NCBI Minute: Integrating PubChem into Your Chemistry TeachingNCBI Minute: Integrating PubChem into Your Chemistry Teaching
NCBI Minute: Integrating PubChem into Your Chemistry Teaching
 

Kürzlich hochgeladen

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxabhishekdhamu51
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 

Kürzlich hochgeladen (20)

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 

PubChem for drug discovery in the age of big data and artificial intelligence

  • 1. PubChem for Drug Discovery in the Age of Big Data and Artificial Intelligence Sunghwan Kim, Ph.D., M.Sc.
  • 2. 1. What is PubChem?
  • 3. ▪ https://pubchem.ncbi.nlm.nih.gov ▪ Public chemical chemical database at NIH. ▪ Contains information on various chemical entities: • Small molecules • siRNAs & miRNAs • Carbohydrates • Lipids • Peptides • Chemically modified macromolecules • …… PubChem is a Public Chemical Information Resource
  • 4. PubChem is a Data Aggregator PubChem Sources: https://pubchem.ncbi.nlm.nih.gov/sources Gov’t agencies Academic institutions Publishers Pharma companies Chemical vendors Scientific databases 790+ Data sources Users o Biomedical Researchers • Chemical biology • Medicinal chemistry • Drug design & discovery • Cheminformatics o Data scientists o Patent agents/examiners o Chemical safety officers o Educators/librarians o Students
  • 5. 0 1 2 3 4 5 6 Unique Monthly Users (millions) Time Monthly Usage Statistics (Interactive Users Only) Source: Google Analytics ▪ 5 million unique interactive users per month at peak (Oct. 2021) ▪ Programmatic requests are not included. ▪ These statistics are lower-bound.
  • 6. 2. What does PubChem have?
  • 8. Structures and properties Spectra PubChem Data Content
  • 9. Structures and properties Spectra Chemical health & safety PubChem Data Content
  • 10. Structures and properties Spectra Chemical health & safety Bioactivity PubChem Data Content
  • 11. Structures and properties Spectra Chemical health & safety Bioactivity Chemical vendors & synthesis PubChem Data Content
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. 34
  • 35. Gene/Protein Target Page ➢ Suppose that you want to: o Retrieve ALL active compounds against a given protein/gene target (e.g., HMGCR=3-hydroxy-3-methylglutaryl-CoA reductase). • To identify common chemical scaffolds responsible for bioactivity. • To build a quantitative structure-activity relationship (QSAR) model. →Gene/Protein Target page • Provides a target-centric view of PubChem data. • Organizes all data available in PubChem for a given gene/protein.
  • 43. Patent View Page ➢ Suppose that you want to: o Retrieve ALL chemicals mentioned in a given patent document. →Patent View page • Provides a list of chemicals “mentioned” in the patent application/grant. • No information on why they are mentioned. (e.g., as a subject matter or as a prior art?) • Other information, including: - Title, abstract, date, inventor, … - International patent classification (IPC) codes
  • 51. ➢ PubChem users have very diverse backgrounds/interests. ➢ PubChem’s web interfaces are optimized to perform commonly requested tasks interactively.
  • 52. ➢ PubChem users have very diverse backgrounds/interests. ➢ PubChem’s web interfaces are optimized to perform commonly requested tasks interactively. ➢ Everything you can do with PubChem through the web browser can be automated through PubChem’s programmatic interfaces.
  • 53. ➢ PubChem users have very diverse backgrounds/interests. ➢ PubChem’s web interfaces are optimized to perform commonly requested tasks interactively. ➢ Everything you can do with PubChem through the web browser can be automated through PubChem’s programmatic interfaces. ➢ Programmatic access enables one to do much more complicated tasks that cannot be done through the web browser.
  • 54. ➢ Multiple programmatic access routes ➢ Two major programmatic access methods o PUG-REST (primarily for computed properties). https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest o PUG-View (primarily for text information). https://pubchemdocs.ncbi.nlm.nih.gov/pug-view ➢ Request volume limitation: o No more than 5 requests per second (See more at: https://pubchemdocs.ncbi.nlm.nih.gov/programmatic- access$_RequestVolumeLimitations) o Violators/abusers may be blocked for a certain period of time. Entrez Utilites (E-Utils) Power User Gateway (PUG) PUG-SOAP PUG-REST PubChem RDF REST PUG_View
  • 55. ➢ Bulk Download o PubChem FTP Site ftp://ftp.ncbi.nlm.nih.gov/pubchem o PubChem RDF (Resource Description Network) https://pubchemdocs.ncbi.nlm.nih.gov/rdf
  • 56. 5. Showcase: Bioactivity Prediction Model Building with PubChem Data
  • 57. ▪ Involved in regulation of gene expression in various biological processes. ▪ Potential roles in: • metabolic signaling pathways • skin alopecia (spot baldness) • dermal cysts • cardiac development • insulin sensitization • …… ▪ Let’s build binary classifiers (i.e, active vs. inactive) for chemical modulators of RXRA Retinoid X Receptor  (RXRA) PDB ID: 1FBY
  • 58. Tox21 (AID 1159531) • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive Data sets
  • 59. Tox21 (AID 1159531) Training (4916 compounds) Test (547 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive 90% 10% Data sets
  • 60. Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives • 205 actives • 17 inactives • 20 actives • 469 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive • Data extracted from journal articles • Predominantly active • qHTS data • Predominantly inactive • Some overlap w/ Tox21 Preprocessing Preprocessing 90% 10% Data sets
  • 61. Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives • 205 actives • 17 inactives • 20 actives • 469 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive • Data extracted from journal articles • Predominantly active • qHTS data • Predominantly inactive • Some overlap w/ Tox21 Preprocessing Preprocessing 90% 10% Data sets All data Available in PubChem.
  • 62. Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives • 205 actives • 17 inactives • 20 actives • 469 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive • Data extracted from journal articles • Predominantly active • qHTS data • Predominantly inactive • Some overlap w/ Tox21 Preprocessing Preprocessing 90% 10% Data sets
  • 63. Tox21 (AID 1159531) ChEMBL (45 Assays) NCGC (2 Assays) Training (4916 compounds) Test (547 compounds) External 1 (222 compounds) External 2 (489 compounds) • 471 actives • 4,445 inactives • 53 actives • 494 inactives • 205 actives • 17 inactives • 20 actives • 469 inactives Preprocessing • Quantitative HTS (qHTS) data for 10K compounds • Predominantly inactive • Data extracted from journal articles • Predominantly active • qHTS data • Predominantly inactive • Some overlap w/ Tox21 Preprocessing Preprocessing 90% 10% 471 Data sets
  • 64. ❑ Molecular descriptors • Generated using PaDEL [Yap CW (2011). J. Comput. Chem., 32 (7): 1466-1474] Model Building Abbreviation Name Length AP AtomPairs 2D Fingerprint 780 ESTAT Estate fingerprint 79 EXTFP* CDK Extended Fingerprint 1,024 FP* CDK fingerprint 1,024 GOFP* CDK graph only fingerprint 1,024 KR Klekota-Roth fingerprint 4,860 MACCS MACCS fingerprint 166 PUB PubChem fingerprint 881 SUB Substructure fingerprint 307 * Hashed fingerprints
  • 65. ❑ Machine-learning algorithms (implemented in scikit-learn) Abbreviation Name Hyperparameters optimized NB Naï ve Bayes  (10-10 ~ 1) DT Decision tree max_depth_range (3 ~ 7) min_samples_split_range (3 ~ 7) min_samples_leaf_range (2 ~ 6) kNN K-Nearest neighbors weights (uniform, minkowski, jaccard) n_neighbors (1 ~ 25) RF Random forest n_estimators (10 ~ 200) SVM Support vector machine C ( 2-10 ~ 210);  ( 2-10  210) NN Neural network solver (lbfgs or adam);  (10-7  107) ▪ 10-fold cross-validation was used for hyperparameter optimization. Model Building
  • 66. Model Performance Evaluation ▪ Area under the Receiver operating characteristic curve (AUC) → Used for hyperparameter optimization. ▪ 𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝐵𝐴𝐶𝐶 = 1 2 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 + 𝑇𝑁 𝑇𝑁 + 𝐹𝑃 = 1 2 𝑆𝐸𝑁𝑆 + 𝑆𝑃𝐸𝐶 ▪ 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝑆𝐸𝑁𝑆) = 𝑇𝑃 𝑇𝑃+𝐹𝑁 ▪ 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑆𝑃𝐸𝐶 = 𝑇𝑁 𝑇𝑁+𝐹𝑃
  • 67. ❑ Performance of the models ▪ AUC scores of 0.7 were observed for models developed using: PubChem/MACCS/CDK-FP with NN/SVM/RF/kNN ▪ Maximum AUC score (0.77): PubChem fingerprint with RF ▪ Similar trend was observed for the performance in terms of BACC scores (not shown here). Area under ROC curve (AUC) Model Performance Evaluation
  • 68. Area under ROC curve (AUC), Inactive-to-active ratio = 1 NCGC ChEMBL General applicability of the models
  • 69. • PubChem is the largest source of publicly available chemical information, collected from hundreds of data sources. • In addition to bioactivity data generated through high-throughput screenings, PubChem contains a substantial amount of bioactivity information extracted from scientific articles. • Chemical vendor and patent information for compounds in PubChem helps prioritize hit compounds for further screening. Summary
  • 70. • PubChem supports programmatic access to its data, allowing for building an automated virtual screening pipeline. • PubChemRDF allows users to download PubChem data on a local computing facility and integrate them with their own data. • PubChem data can be used for developing computational prediction models for bioactivity or toxicity of molecules. Summary
  • 71. Acknowledgements ▪ The PubChem Team ▪ Funding Evan Bolton Jia He Thiessen Paul Zhi Sun Jie Chen Siqian He Bo Yu Tiejun Chung Qingliang Li Leonid Zaslavsky Asta Gindulyte Ben Shoemaker Jian Zhang Intramural Research Program of the National Library of Medicine
  • 72. Thank you for your attention. Questions? Sunghwan Kim (kimsungh@ncbi.nlm.nih.gov)
  • 73. ❑ References ▪ Getting the most out of PubChem for virtual screening S. Kim, Expert Opin. Drug Discov. 2016, 11(9), 843-855. ▪ PubChem in 2021: new data content and improved web interfaces S. Kim et al., Nucleic Acids Res. 2021, 49(D1):D1388–D1395. ▪ An update on PUG-REST: RESTful interface for programmatic access to PubChem S. Kim et al., Nucleic Acids Res. 2018, 46(W1):W563-W570. ▪ PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem S. Kim et al., Nucleic Acids Res. 2015, 43(W1):W605-W611.