Presented on October 20, 2020 at the 9th American Society for Cellular and Computational Toxicology (ASCCT) National Meeting.
==== Abstract ====
PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource at the U.S. National Institutes of Health. It collects chemical information from 750+ data sources and disseminates it to the public free of charge. Arguably, PubChem contains the largest amount of chemical information available in the public domain, with more than 265 million depositor-provided substance descriptions, 100 million unique chemical structures, and 270 million bioactivity outcomes from one million assays covering around twenty thousand unique protein target sequences.
Included in the many types of content in PubChem is toxicological information about chemicals, e.g., human and animal toxicity, ecotoxicity, exposure limits, exposure symptoms, and antidote & emergency treatment. Notably, a substantial amount of toxicological information from resources formerly offered by the TOXicology data NETwork (TOXNET) is now integrated into PubChem, e.g., the Hazardous Substances Data Bank (HSDB), Genetic Toxicology Data Bank (Gene-Tox), Chemical Carcinogenesis Research Information System (CCRIS), LactMed, and LiverTox. In addition, PubChem contains a large amount of bioactivity and toxicity screening data that can be used to build toxicity prediction models based on statistical and machine-learning approaches. This presentation provides an overview of PubChem’s toxicological information and describes how open data in PubChem can be used to develop prediction models for chemical toxicity.
PubChem as an Emerging Toxicological Information Resource
1. PubChem as an Emerging Toxicological Information Resource
Sunghwan Kim (sunghwan.kim@nih.gov), Jian Zhang, Paul A. Thiessen, Asta Gindulyte, Pertti J. Hakkinen & Evan E. Bolton
Tools and Services
Classification browser
https://pubchem.ncbi.nlm.nih.gov/classification
Identifier exchange service
https://pubchemdocs.ncbi.nlm.nih.gov/identifier-
exchange-service
Score matrix service
https://pubchemdocs.ncbi.nlm.nih.gov/score-matrix-
service
Structure download
https://pubchem.ncbi.nlm.nih.gov/pc_fetch/pc_fetch.cgi
Assay download
https://pubchem.ncbi.nlm.nih.gov/assay/assaydownload.cgi
Bulk download vis FTP
ftp://ftp.ncbi.nlm.nih.gov/pubchem
Programmatic access
PUG-REST
https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest
PUG-View
https://pubchemdocs.ncbi.nlm.nih.gov/pug-view
PubChem RDF
https://pubchem.ncbi.nlm.nih.gov/rdf
PubChem (https://pubchem.ncbi.nlm.nih.gov)
Public chemical database at the U.S. National
Institutes of Health.
Contains a large amount of chemical information,
collected from 750+ data sources.
Organizes its data into three major databases:
• Substance: archives depositor-submitted
descriptions of substances.
• Compound: contains unique chemical structures
extracted from the Substance database.
• BioAssay: stores biological assay descriptions and
test results.
Has additional data collections that provide
information on genes, proteins, pathways, scientific
articles, and patents related to chemicals.
• Gene (ex. EGFR)
https://pubchem.ncbi.nlm.nih.gov/gene/1956
• Protein (ex. HMG-CoA reductase)
https://pubchem.ncbi.nlm.nih.gov/protein/P04035
• Pathway (ex. TCA Cycle)
• https://pubchem.ncbi.nlm.nih.gov/pathway/Reacto
me:R-HSA-1428517
• US Patent (ex. US7030151B2)
• https://pubchem.ncbi.nlm.nih.gov/patent/US-
7030151-B2
Acknowledgements
This work was supported by the Intramural Research
Program of the National Library of Medicine, National
Institutes of Health.
Predictive Toxicology Modelling
Using PubChem
Bioactivity data in PubChem
High-throughput screening data
• Molecular Libraries Program
• Tox21
Literature-extracted bioactivity data
• Scientific articles
• Patent documents
PubChem’s bioactivity data have been used to
develop computational models for chemical toxicity
prediction (with machine-learning algorithms, such
as neural network, support vector machine, random
forest, etc.).
Toxicological Information in PubChem
Human/non-human toxicity
Health effects (short- and long-term)
Populations at special risk
Exposure routes, symptoms & limits (REL, PEL)
Average daily Intake, body burden & target organs
Ecotoxicity & environmental concentration/degradation
Many others
→ All of these data are collected from authoritative
sources, including U.S. NLM/NIH, EPA, CDC, USGS,
OSHA, etc.
Summary
PubChem contains a large amount of publicly
available chemical information.
It has a wide range of toxicological information,
collected from many authoritative sources.
Data from several ToxNet databases have been
integrated into PubChem (e.g., HSDB, ChemIDPlus,
CCRIS, GeneTox, LiverTox, and LactMed) .
PubChem’s bioactivity data are used to develop
computational models for chemical toxicity
prediction.
Efforts to add additional toxicology-related data
sources are underway.
ToxNet Data in PubChem
(https://pubchemdocs.ncbi.nlm.nih.gov/toxnet)
TOXicology Data NETwork.
Consisted of toxicological databases at NLM.
Retired in December 2019 as part of a broader NLM
reorganization.
Most of its content have been integrated into other NLM
products and services.
Toxicological information on chemicals are integrated
into PubChem.
ChemIDPlus
HSDB (Hazardous Substances Data Bank)
→ Integrated as annotations of compounds.
Available through the Compound Summary page.
(ex.) HSDB data for benzene:
https://pubchem.ncbi.nlm.nih.gov/compound/241#
source=Hazardous%20Substances%20Data%20B
ank%20(HSDB)
Gene-Tox (Genetic Toxicology Data Bank)
CCRIS (Chemical Carcinogenesis Research
Information System)
→ Integrated as bioassays and substances tested in
them.
(ex.) Mutagenicity study data from CCRIS
https://pubchem.ncbi.nlm.nih.gov/bioassay/12
59407
(ex.) Aniline data in CCRIS
https://pubchem.ncbi.nlm.nih.gov/substance/3
63897994
LiverTox & LactMed
→ Data downloadable from PubChem Sources:
https://pubchem.ncbi.nlm.nih.gov/source/23224
https://pubchem.ncbi.nlm.nih.gov/source/15404