SlideShare ist ein Scribd-Unternehmen logo
1 von 40
0
Kamel Mansouri, Chris Grulke, Ann Richard*, and Antony J. Williams
National Center for Computational Toxicology, US EPA, RTP, NC
The influence of data curation on
QSAR Modeling – examining issues
of quality versus quantity of data
March, 2016
ACS Spring Meeting, San Diego, CA
This work was reviewed by EPA and approved for presentation but does not necessarily reflect official Agency policy.
1
Sustainable
Chemistry
Human Health
Hazard
Ecosystems
Hazard
Environmental
Persistence
“ design and use of chemicals that
minimize impacts to human health,
ecosystems and the environment”
 Screen chemicals for toxicity &
exposure potential
 Prioritize testing of thousands of
existing chemicals lacking data
 Evaluate how chemical
transformations impact hazard &
environmental persistence
DSSTox
Chemical
IDs
Chemistry Foundation to Support
Multiple NCCT & EPA Projects
2
EPA SRS
(TSCA, EcoTox)
EDSP21
ToxCast/Tox21
CPCAT
ToxRef
1:1 mapping
CAS:Name:structure
Quality scores
>700K substances
“Chemical Safety
& Sustainability”
(CSS) Public
Web-accessible
Dashboards
Knowledge-
informed features
& chemotypes
ToxCast/Tox21
High-throughput
screening (HTS)
projects
CAS look-up
List curation tools
Chemical search by
CAS-names
SMILES
Structures
Structure file
downloads
PK/ADME model
inputs & outputs
Structure “cleaning”
SAR-ready structure files
Structure-based predictions
Similarity searching
Fingerprints, feature sets
QSAR models;
phys-chem properties
Phys-chem calc &
measured properties
DSSTox Update
3
DSSTox_v1 DSSTox_v2
 Convert DSSTox tables to MySQL
 Develop curation interface &
cheminformatics workflow
 Expand chemical content
 Web-services & Dashboard access
 Manually curated 25K substance
records
 EPA-focus, environmental tox datasets
 Emphasis on accurate CAS-name-
structure annotations
 Public resource for high-quality
structure-data files (SDF)
Building DSSTox_v2
DSSTox_v1
~21K
~5K
1:1:1 CAS:name:structure
No conflicts
allowed
Environmental/Toxicity/Exposure Relevance
Manually CuratedEPA
SRS
ChemID
PubChemEPA
ACToR
EPA-
relevant
lists
~735K ~150K
>1M
~54K
~310K ~625K ~5K~16K~33K~101K
DSSTox QC
Levels:
Public_Med
Public_High DSSTox_High
DSSTox_Low
Public_LowPublic_Untrusted
(not loaded)
DSSTox_v2 Totals
DSSTox_v2
PublicCurated
5. Low
2. Low
6. Untrusted
1. High
3. High
4. Med
7. Incomplete
QC Level Totals (12Jun2015)
- highest confidence
Public_Untrusted
Public_Low
Public_Medium
Public_High
DSSTox_Low
DSSTox_High 4535
16K
33K
101K
584K
~ 310K pending
~ 150K pending
~ 150K substances in
top 4 QC quality bins
(single source)
Building the Chemistry
Software Architecture
External Services
APIs and WEB SERVICES
DATA LAYER
Chemical-BiologyPlatforms
USER INTERFACES
UI COMPONENTS
Building Predictive Models
• Receptor-mediated
activities (e.g., ER, AR)
• Toxicity endpoint QSAR
models
• Metabolism prediction
• In vitro-in vivo toxicity
extrapolation models (IVIVE)
• Read-across toxicity
estimation
• Near and far-field exposure
models
NCCTModelingActivities
• Analytical dust and
media screening for
environmental
contaminants
• Biotransformation,
transport, fate modeling
• ADME, PBPK models
• Chemical use category
modeling
• Green chemistry
EPA-NCCTCollaborations
Physical
chemical
propertiesWater solubility
LogP
Partition coefficients
(air/water/soil)
MP, BP, Vapor Pressure
Building a physical chemical
property database
Physical
chemical
properties
Measured
(where available)
Predicted
(where not available)
 Collect data
(sort wheat from chaff)
 Determine how to best
mesh data together
(structure? CAS? Other?)
 Check data self-consistency
 Structure validation vs.
property value validation
(very different challenges)
 Build training/test sets of
chemicals with
experimental values
(collapse or resolve dups)
 Apply structure-cleaning
workflow to produce
“QSAR-ready” structures
 Calculate descriptors
 Apply modeling algorithm
(regression, ML, etc.)
Store measured & predicted values in relational
database, linked to data sources, model details, etc.
Curating structure-linked data
• Establish accurate linkage of measured data
to chemical structure
• Published experimental data  chemical IDs
{CAS, name, and/or structure}
– Fix ID errors
– Resolve ID conflicts
– Normalize, clean structures
– Resolve duplicate mappings
9
Lots of experience from building DSSTox!
EPI SuiteTM Property Estimation
Software
• Developed by EPA’s Office of Pollution Prevention & Toxics
(w/ Syracuse Research Corp), published initial models over
20 yrs ago, released desktop app in 2000
• Estimates a wide range of physical and environmental
properties using QSAR approaches,
– KOWWINTM, AOPWINTM, HENRYWINTM, MPBPWINTM, BIOWINTM,
BioHCwin, KOCWINTM
• and incorporates these into dependent fate and toxicity
models,
– WSKOWWINTM, WATERNTTM, BCFBAFTM, HYDROWINTM, KOAWIN,
AEROWINTM, WVOLWINTM, STPWINTM, LEV3EPITM, ECOSARTM
• Used to fill data gaps in EPA Pre-Manufacture Notification
(PMN) chemical submissions, wide usage outside EPA
10
EPI SuiteTM Property Estimation
Software
11
• Downloadable Windows application available at:
http://www.epa.gov/tsca-screening-tools/epi-suitetm-estimation-
program-interface
• 14 PhysProp experimental property data sets used in training models
can be freely accessed from within app
EPI Suite KOWWIN:
2447 Training Compounds
• 2447 chemicals with
measured Log Kow values
used to train model
• >15K measured Log Kow
values available in
PhysProp file
Overall Predicted vs. Experimental
(>15k chemicals)
• Use EPI Suite model built
on 2447 measured
values to predict >15K
values in PhysProp file
Why derive new models from EPI
Suite PhysProp datasets?
• Significant advances in cheminformatics and
modeling approaches since <2000
• Enlarged training sets likely to improve models
and expand applicability domain
• But first…
Perhaps we should take a look at the >20 yr
old PhysProp datasets that we’ll be using to
rebuild the models!
Review of PhysProp datasets
• Exported SDF files from EPI Suite
application
• Basic manual review searching for errors:
– hypervalency, charge imbalance, undefined stereo
– deduplication
– mismatches between identifiers
CAS Numbers not matching structure
Names not matching structure
Collisions between identifiers
Incorrect valences
Differences in Values (MP)
Two experimental records:
Same structure
Different CAS, name,
experimental & predicted MP
Covalently bound salt structures
Collisions in Records
Same structure depictions (Molfiles) - different
CAS, different names, AND different SMILES
Missing or erroneous
identifiers
• Many chemical
names are truncated
• Many chemicals don’t
have CAS Numbers
• Invalid CAS Checksum: 3646
• Invalid names: 555
• Invalid SMILES: 133
• Valence errors: 322 Molfile, 3782 SMILES
• Duplicates check:
– 31 MOLFILE; 626 SMILES; 531 NAMES
• SMILES vs. Molfiles (structure check):
– 1279 differ in stereochemistry
– 362 “covalent halogens”
– 191 differ as tautomers
– 436 are different compounds
Whoa!! Lots of problems…
15,809 chemicals in KOWWIN data file
Quality flags added to data:
1-4 Stars
FLAG Definition
4 Stars ENHANCED 4 levels of consistency with stereo information
4 Stars 4 levels of consistency, stereo ignored
3 Starts Plus 3 of 4 levels consistent; 4th is a tautomer
3 Stars ENHANCED 3 levels of consistency with stereo information
3 Stars 3 levels of consistency, stereo ignored.
2 Stars Plus 2 of 4 levels consistent; 3rd is a tautomer
1 Star Whatever's left – too many errors to count!
 Molblock
 SMILES string
 chemical name (based on ACD/Labs dictionary)
 CAS Number (based on a DSSTox lookup)
4 levels of
consistency
possible
Auto-curation
KNIME structure-“cleaning” workflow
23
Indigo
 Combine community approaches to structure processing
 Develop a flexible workflow to be used by EPA and shared publicly
 Process DSSTox files to create “QSAR-ready” structures
Publicly available cheminformatics
toolkits in KNIME:
https://www.knime.org/knime
 Parse SDF, remove fragments
 Explicit hydrogen removed
 Dearomatization
 Removal of chirality info, isotopes and
pseudo-atoms
 Aromatization + add explicit hydrogen
atoms
 Standardize Nitro groups
 Other tautomerize/mesomerization
 Neutralize (when possible)
Building the Models
• QSAR ready forms for modeling – standardize
tautomers, remove stereochemistry, no salts
• Remove approx. 1800 chemicals due to poor
quality score
• Build using 3 STAR and BETTER chemicals
Rederived model for PhysProp
Log Kow (n=14049)
25
Weighted kNN model, 5-nearest
neighbors
Training: 11251 chemicals
Test set: 2798 chemicals
5 fold cross validation:
R2: 0.87
RMSE: 0.67
Open source
PaDEL
descriptors
Remember the original EPI Suite
outlier scatter?
26
EPISuite Predicted vs. Experimental EPISuite vs kNN Comparison of Cluster
• Applicability domain (AD) of original EPI Suite LogKow
Model improved in updated PhysProp model
* 280 cmpd cluster outside AD of EPI Suite  no longer outliers in new model
Updated PhysProp results
• Performed curation and cleaning of all 14 PhysProp measured
property datasets
• Rederived models using modern ML methods & open source
descriptors provide significantly improved prediction accuracy
over older EPI Suite models
• Predictions generated for entire (>700K) DSSTox database,
stored in new property database
• Measured & predicted properties to be surfaced in first release
of iCSS Chemistry Dashboard (April 2016)
 Global statistics insensitive to local curation improvements in
>15K training set, BUT exceedingly important when surfacing
measured properties in support of individual prediction values
27
iCSS Chemistry Dashboard
Releasing April 2016
28
• PHASE 1 Delivery – Web interface supporting CSS research
 access to DSSTox content: >700,000 chemicals
 access to experimental property data
• Initial set of models based on reanalysis of cleaned, curated
EPI Suite PHYSPROP datasets – logP, BP, MP, Wsol etc.
iCSS Chemistry Dashboard
Releasing in April 2016
29
Formerly DSSTox GSID, new unique
public substance ID, essential for
RDF/semantic web applications
Melting Point
iCSS Chemistry Dashboard
Releasing in April 2016
• Original raw source result view
• Users can submit comments
• Links to external resources: EPA, NIH, property predictors
31
iCSS Chemistry Dashboard
Releasing in April 2016
e.g., ChemAxon’s
“Chemicalize” web-service
e.g., PubChem’s
bioactivity summary
Linkage to External Predictor, e.g.
Chemicalize
32
Chemicalize.org (powered by ChemAxon)
Atomic
Charges
Molecule
pKa
Working to integrate other EPA
predictors
• ExpoCast
 near & far-field exposure models
• Environmental Fate Simulator (EFS)
 air/soil/water distribution, biotransformation
• CERAPP
 estrogen receptor activity QSAR model
• T.E.S.T (in progress)
 phys-chem properties & toxicity endpoints
33
e.g., T.E.S.T.
34
http://www.epa.gov/chemical-research/toxicity-estimation-software-tool-test
 Downloadable Windows app
 Both phys-chem and tox
endpoints predicted
 Multiple QSAR modeling
methods employed
 Multiple views of data
T.E.S.T. QSAR Model
Prediction Views
35
Model statistics
Query chemical
Prediction results
Building “NCCT Models”
• Collect “Open Data” for various endpoints of interest to
EPA, from e.g.
 PubChem
 eChemPortal
 Open Data Sources
• & Build QSAR prediction models
 Using modern machine-learning approaches
 Ensuring high quality data as inputs
 Populating database with measured & predicted values
 Making experimental data training sets available
 Allowing real-time predictions (in the future)
36
Quality review required at all levels of chemical
“Representations”
Structure
Generic
Substance
Test
Sample
Supplier, Lot/Batch,
physical description
Features
Properties
Descriptors, fingerprints, charges,
phys-chem properties, etc.
SMILES
InChI
Chemical Name
CASRN
Increasingdataaggregation
In conclusion…
• iCSS Chemistry Dashboard will provide public access to
data & services focused on chemicals of interest to EPA
• Initially serve up results for measured & pre-predicted
physical chemical property data
• Chemical-Data consistency quality flags for DSSTox and
property data add value to public domain data, BUT …
chemical-data linkages require curation/validation!!
• All data and models will be available as OPEN DATA and
OPEN CODE
Acknowledgements
NCCT Contributors to the iCSS Chemistry Dashboard
39
Jeff Edwards
Jeremy Fitzpatrick
Jordan Foster
Jason Harris
Dave Lyons
Kamel Mansouri
Aria Smith
Jennifer Smith
Indira Thillainadarajah
Chris Grulke
Antony Williams

Weitere ähnliche Inhalte

Was ist angesagt?

International Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsInternational Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsKamel Mansouri
 
Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Kamel Mansouri
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...Kamel Mansouri
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Kamel Mansouri
 
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryAnn-Marie Roche
 
Update on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL ProjectUpdate on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL ProjectIES / IAQM
 

Was ist angesagt? (20)

The influence of data curation on QSAR Modeling – examining issues of qualit...
 The influence of data curation on QSAR Modeling – examining issues of qualit... The influence of data curation on QSAR Modeling – examining issues of qualit...
The influence of data curation on QSAR Modeling – examining issues of qualit...
 
Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...Structure Identification Using High Resolution Mass Spectrometry Data and the...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
 
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
The EPA Online Prediction Physicochemical Prediction Platform to Support Envi...
 
Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings
 
International Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology ProblemsInternational Computational Collaborations to Solve Toxicology Problems
International Computational Collaborations to Solve Toxicology Problems
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...Automated workflows for data curation and standardization of chemical structu...
Automated workflows for data curation and standardization of chemical structu...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
Progress in Using Big Data in Chemical Toxicity Research at the National Cent...
 
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
CERAPP - Collaborative Estrogen Receptor Activity Prediction Project. Computa...
 
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...Virtual screening of chemicals for endocrine disrupting activity: Case studie...
Virtual screening of chemicals for endocrine disrupting activity: Case studie...
 
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
Exploiting enhanced non-testing approaches to meet the needs for sustainable ...
 
The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
Data drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistryData drivenapproach to medicinalchemistry
Data drivenapproach to medicinalchemistry
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...How to place your research questions or results into the context of the "Lega...
How to place your research questions or results into the context of the "Lega...
 
Organic chemist
Organic chemistOrganic chemist
Organic chemist
 
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Env...
 
Update on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL ProjectUpdate on Phase 2 of the C4SL Project
Update on Phase 2 of the C4SL Project
 

Ähnlich wie The influence of data curation on QSAR Modeling – Presented at American Chemical Society, San Diego, CA, March 13 - 17, 2016.

Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Sunghwan Kim
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...Kamel Mansouri
 
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsDevelopment and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsSean Ekins
 
Validation of homology modeling
Validation of homology modelingValidation of homology modeling
Validation of homology modelingAlichy Sowmya
 
Reaxys rmc unified platform_ webinar_
Reaxys rmc unified platform_ webinar_Reaxys rmc unified platform_ webinar_
Reaxys rmc unified platform_ webinar_Ann-Marie Roche
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsSean Ekins
 

Ähnlich wie The influence of data curation on QSAR Modeling – Presented at American Chemical Society, San Diego, CA, March 13 - 17, 2016. (20)

Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
 
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...Using open bioactivity data for developing machine-learning prediction models...
Using open bioactivity data for developing machine-learning prediction models...
 
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
 
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
Accessing Environmental Chemistry Data via Data Dashboards and Applications t...
 
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...Applications of the US EPA’s CompTox chemicals dashboard to support structure...
Applications of the US EPA’s CompTox chemicals dashboard to support structure...
 
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
The US-EPA CompTox Chemicals Dashboard – a key player in the domain of Open S...
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
 
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning modelsDevelopment and sharing of ADME/Tox and Drug Discovery Machine learning models
Development and sharing of ADME/Tox and Drug Discovery Machine learning models
 
Cheminformatics approaches to support chemical identification delivered via t...
Cheminformatics approaches to support chemical identification delivered via t...Cheminformatics approaches to support chemical identification delivered via t...
Cheminformatics approaches to support chemical identification delivered via t...
 
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
 
Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...Consensus ranking and fragmentation prediction for identification of unknowns...
Consensus ranking and fragmentation prediction for identification of unknowns...
 
Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...Delivering web-based access to data and algorithms to support computational t...
Delivering web-based access to data and algorithms to support computational t...
 
Validation of homology modeling
Validation of homology modelingValidation of homology modeling
Validation of homology modeling
 
Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Reaxys rmc unified platform_ webinar_
Reaxys rmc unified platform_ webinar_Reaxys rmc unified platform_ webinar_
Reaxys rmc unified platform_ webinar_
 
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning ModelsMining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
Mining 'Bigger' Datasets to Create, Validate and Share Machine Learning Models
 

Mehr von Kamel Mansouri

Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)Kamel Mansouri
 
Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...Kamel Mansouri
 
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityCoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityKamel Mansouri
 
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Kamel Mansouri
 
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...Kamel Mansouri
 
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...Kamel Mansouri
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...Kamel Mansouri
 

Mehr von Kamel Mansouri (7)

Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
Chemical prioritization using in silico modeling. SOT 2018 (San Antonio, USA)
 
Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...Scoring and ranking of metabolic trees to computationally prioritize chemical...
Scoring and ranking of metabolic trees to computationally prioritize chemical...
 
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor ActivityCoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
CoMPARA: Collaborative Modeling Project for Androgen Receptor Activity
 
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
Consensus Models to Predict Endocrine Disruption for All Human-Exposure Chemi...
 
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
 
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
In-silico study of ToxCast GPCR assays by quantitative structure-activity rel...
 
In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...In-silico structure activity relationship study of toxicity endpoints by QSAR...
In-silico structure activity relationship study of toxicity endpoints by QSAR...
 

Kürzlich hochgeladen

VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 

Kürzlich hochgeladen (20)

VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 

The influence of data curation on QSAR Modeling – Presented at American Chemical Society, San Diego, CA, March 13 - 17, 2016.

  • 1. 0 Kamel Mansouri, Chris Grulke, Ann Richard*, and Antony J. Williams National Center for Computational Toxicology, US EPA, RTP, NC The influence of data curation on QSAR Modeling – examining issues of quality versus quantity of data March, 2016 ACS Spring Meeting, San Diego, CA This work was reviewed by EPA and approved for presentation but does not necessarily reflect official Agency policy.
  • 2. 1 Sustainable Chemistry Human Health Hazard Ecosystems Hazard Environmental Persistence “ design and use of chemicals that minimize impacts to human health, ecosystems and the environment”  Screen chemicals for toxicity & exposure potential  Prioritize testing of thousands of existing chemicals lacking data  Evaluate how chemical transformations impact hazard & environmental persistence
  • 3. DSSTox Chemical IDs Chemistry Foundation to Support Multiple NCCT & EPA Projects 2 EPA SRS (TSCA, EcoTox) EDSP21 ToxCast/Tox21 CPCAT ToxRef 1:1 mapping CAS:Name:structure Quality scores >700K substances “Chemical Safety & Sustainability” (CSS) Public Web-accessible Dashboards Knowledge- informed features & chemotypes ToxCast/Tox21 High-throughput screening (HTS) projects CAS look-up List curation tools Chemical search by CAS-names SMILES Structures Structure file downloads PK/ADME model inputs & outputs Structure “cleaning” SAR-ready structure files Structure-based predictions Similarity searching Fingerprints, feature sets QSAR models; phys-chem properties Phys-chem calc & measured properties
  • 4. DSSTox Update 3 DSSTox_v1 DSSTox_v2  Convert DSSTox tables to MySQL  Develop curation interface & cheminformatics workflow  Expand chemical content  Web-services & Dashboard access  Manually curated 25K substance records  EPA-focus, environmental tox datasets  Emphasis on accurate CAS-name- structure annotations  Public resource for high-quality structure-data files (SDF)
  • 5. Building DSSTox_v2 DSSTox_v1 ~21K ~5K 1:1:1 CAS:name:structure No conflicts allowed Environmental/Toxicity/Exposure Relevance Manually CuratedEPA SRS ChemID PubChemEPA ACToR EPA- relevant lists ~735K ~150K >1M ~54K ~310K ~625K ~5K~16K~33K~101K DSSTox QC Levels: Public_Med Public_High DSSTox_High DSSTox_Low Public_LowPublic_Untrusted (not loaded)
  • 6. DSSTox_v2 Totals DSSTox_v2 PublicCurated 5. Low 2. Low 6. Untrusted 1. High 3. High 4. Med 7. Incomplete QC Level Totals (12Jun2015) - highest confidence Public_Untrusted Public_Low Public_Medium Public_High DSSTox_Low DSSTox_High 4535 16K 33K 101K 584K ~ 310K pending ~ 150K pending ~ 150K substances in top 4 QC quality bins (single source)
  • 7. Building the Chemistry Software Architecture External Services APIs and WEB SERVICES DATA LAYER Chemical-BiologyPlatforms USER INTERFACES UI COMPONENTS
  • 8. Building Predictive Models • Receptor-mediated activities (e.g., ER, AR) • Toxicity endpoint QSAR models • Metabolism prediction • In vitro-in vivo toxicity extrapolation models (IVIVE) • Read-across toxicity estimation • Near and far-field exposure models NCCTModelingActivities • Analytical dust and media screening for environmental contaminants • Biotransformation, transport, fate modeling • ADME, PBPK models • Chemical use category modeling • Green chemistry EPA-NCCTCollaborations Physical chemical propertiesWater solubility LogP Partition coefficients (air/water/soil) MP, BP, Vapor Pressure
  • 9. Building a physical chemical property database Physical chemical properties Measured (where available) Predicted (where not available)  Collect data (sort wheat from chaff)  Determine how to best mesh data together (structure? CAS? Other?)  Check data self-consistency  Structure validation vs. property value validation (very different challenges)  Build training/test sets of chemicals with experimental values (collapse or resolve dups)  Apply structure-cleaning workflow to produce “QSAR-ready” structures  Calculate descriptors  Apply modeling algorithm (regression, ML, etc.) Store measured & predicted values in relational database, linked to data sources, model details, etc.
  • 10. Curating structure-linked data • Establish accurate linkage of measured data to chemical structure • Published experimental data  chemical IDs {CAS, name, and/or structure} – Fix ID errors – Resolve ID conflicts – Normalize, clean structures – Resolve duplicate mappings 9 Lots of experience from building DSSTox!
  • 11. EPI SuiteTM Property Estimation Software • Developed by EPA’s Office of Pollution Prevention & Toxics (w/ Syracuse Research Corp), published initial models over 20 yrs ago, released desktop app in 2000 • Estimates a wide range of physical and environmental properties using QSAR approaches, – KOWWINTM, AOPWINTM, HENRYWINTM, MPBPWINTM, BIOWINTM, BioHCwin, KOCWINTM • and incorporates these into dependent fate and toxicity models, – WSKOWWINTM, WATERNTTM, BCFBAFTM, HYDROWINTM, KOAWIN, AEROWINTM, WVOLWINTM, STPWINTM, LEV3EPITM, ECOSARTM • Used to fill data gaps in EPA Pre-Manufacture Notification (PMN) chemical submissions, wide usage outside EPA 10
  • 12. EPI SuiteTM Property Estimation Software 11 • Downloadable Windows application available at: http://www.epa.gov/tsca-screening-tools/epi-suitetm-estimation- program-interface • 14 PhysProp experimental property data sets used in training models can be freely accessed from within app
  • 13. EPI Suite KOWWIN: 2447 Training Compounds • 2447 chemicals with measured Log Kow values used to train model • >15K measured Log Kow values available in PhysProp file
  • 14. Overall Predicted vs. Experimental (>15k chemicals) • Use EPI Suite model built on 2447 measured values to predict >15K values in PhysProp file
  • 15. Why derive new models from EPI Suite PhysProp datasets? • Significant advances in cheminformatics and modeling approaches since <2000 • Enlarged training sets likely to improve models and expand applicability domain • But first… Perhaps we should take a look at the >20 yr old PhysProp datasets that we’ll be using to rebuild the models!
  • 16. Review of PhysProp datasets • Exported SDF files from EPI Suite application • Basic manual review searching for errors: – hypervalency, charge imbalance, undefined stereo – deduplication – mismatches between identifiers CAS Numbers not matching structure Names not matching structure Collisions between identifiers
  • 18. Differences in Values (MP) Two experimental records: Same structure Different CAS, name, experimental & predicted MP
  • 19. Covalently bound salt structures
  • 20. Collisions in Records Same structure depictions (Molfiles) - different CAS, different names, AND different SMILES
  • 21. Missing or erroneous identifiers • Many chemical names are truncated • Many chemicals don’t have CAS Numbers
  • 22. • Invalid CAS Checksum: 3646 • Invalid names: 555 • Invalid SMILES: 133 • Valence errors: 322 Molfile, 3782 SMILES • Duplicates check: – 31 MOLFILE; 626 SMILES; 531 NAMES • SMILES vs. Molfiles (structure check): – 1279 differ in stereochemistry – 362 “covalent halogens” – 191 differ as tautomers – 436 are different compounds Whoa!! Lots of problems… 15,809 chemicals in KOWWIN data file
  • 23. Quality flags added to data: 1-4 Stars FLAG Definition 4 Stars ENHANCED 4 levels of consistency with stereo information 4 Stars 4 levels of consistency, stereo ignored 3 Starts Plus 3 of 4 levels consistent; 4th is a tautomer 3 Stars ENHANCED 3 levels of consistency with stereo information 3 Stars 3 levels of consistency, stereo ignored. 2 Stars Plus 2 of 4 levels consistent; 3rd is a tautomer 1 Star Whatever's left – too many errors to count!  Molblock  SMILES string  chemical name (based on ACD/Labs dictionary)  CAS Number (based on a DSSTox lookup) 4 levels of consistency possible Auto-curation
  • 24. KNIME structure-“cleaning” workflow 23 Indigo  Combine community approaches to structure processing  Develop a flexible workflow to be used by EPA and shared publicly  Process DSSTox files to create “QSAR-ready” structures Publicly available cheminformatics toolkits in KNIME: https://www.knime.org/knime  Parse SDF, remove fragments  Explicit hydrogen removed  Dearomatization  Removal of chirality info, isotopes and pseudo-atoms  Aromatization + add explicit hydrogen atoms  Standardize Nitro groups  Other tautomerize/mesomerization  Neutralize (when possible)
  • 25. Building the Models • QSAR ready forms for modeling – standardize tautomers, remove stereochemistry, no salts • Remove approx. 1800 chemicals due to poor quality score • Build using 3 STAR and BETTER chemicals
  • 26. Rederived model for PhysProp Log Kow (n=14049) 25 Weighted kNN model, 5-nearest neighbors Training: 11251 chemicals Test set: 2798 chemicals 5 fold cross validation: R2: 0.87 RMSE: 0.67 Open source PaDEL descriptors
  • 27. Remember the original EPI Suite outlier scatter? 26 EPISuite Predicted vs. Experimental EPISuite vs kNN Comparison of Cluster • Applicability domain (AD) of original EPI Suite LogKow Model improved in updated PhysProp model * 280 cmpd cluster outside AD of EPI Suite  no longer outliers in new model
  • 28. Updated PhysProp results • Performed curation and cleaning of all 14 PhysProp measured property datasets • Rederived models using modern ML methods & open source descriptors provide significantly improved prediction accuracy over older EPI Suite models • Predictions generated for entire (>700K) DSSTox database, stored in new property database • Measured & predicted properties to be surfaced in first release of iCSS Chemistry Dashboard (April 2016)  Global statistics insensitive to local curation improvements in >15K training set, BUT exceedingly important when surfacing measured properties in support of individual prediction values 27
  • 29. iCSS Chemistry Dashboard Releasing April 2016 28 • PHASE 1 Delivery – Web interface supporting CSS research  access to DSSTox content: >700,000 chemicals  access to experimental property data • Initial set of models based on reanalysis of cleaned, curated EPI Suite PHYSPROP datasets – logP, BP, MP, Wsol etc.
  • 30. iCSS Chemistry Dashboard Releasing in April 2016 29 Formerly DSSTox GSID, new unique public substance ID, essential for RDF/semantic web applications Melting Point
  • 31. iCSS Chemistry Dashboard Releasing in April 2016 • Original raw source result view • Users can submit comments
  • 32. • Links to external resources: EPA, NIH, property predictors 31 iCSS Chemistry Dashboard Releasing in April 2016 e.g., ChemAxon’s “Chemicalize” web-service e.g., PubChem’s bioactivity summary
  • 33. Linkage to External Predictor, e.g. Chemicalize 32 Chemicalize.org (powered by ChemAxon) Atomic Charges Molecule pKa
  • 34. Working to integrate other EPA predictors • ExpoCast  near & far-field exposure models • Environmental Fate Simulator (EFS)  air/soil/water distribution, biotransformation • CERAPP  estrogen receptor activity QSAR model • T.E.S.T (in progress)  phys-chem properties & toxicity endpoints 33
  • 35. e.g., T.E.S.T. 34 http://www.epa.gov/chemical-research/toxicity-estimation-software-tool-test  Downloadable Windows app  Both phys-chem and tox endpoints predicted  Multiple QSAR modeling methods employed  Multiple views of data
  • 36. T.E.S.T. QSAR Model Prediction Views 35 Model statistics Query chemical Prediction results
  • 37. Building “NCCT Models” • Collect “Open Data” for various endpoints of interest to EPA, from e.g.  PubChem  eChemPortal  Open Data Sources • & Build QSAR prediction models  Using modern machine-learning approaches  Ensuring high quality data as inputs  Populating database with measured & predicted values  Making experimental data training sets available  Allowing real-time predictions (in the future) 36
  • 38. Quality review required at all levels of chemical “Representations” Structure Generic Substance Test Sample Supplier, Lot/Batch, physical description Features Properties Descriptors, fingerprints, charges, phys-chem properties, etc. SMILES InChI Chemical Name CASRN Increasingdataaggregation
  • 39. In conclusion… • iCSS Chemistry Dashboard will provide public access to data & services focused on chemicals of interest to EPA • Initially serve up results for measured & pre-predicted physical chemical property data • Chemical-Data consistency quality flags for DSSTox and property data add value to public domain data, BUT … chemical-data linkages require curation/validation!! • All data and models will be available as OPEN DATA and OPEN CODE
  • 40. Acknowledgements NCCT Contributors to the iCSS Chemistry Dashboard 39 Jeff Edwards Jeremy Fitzpatrick Jordan Foster Jason Harris Dave Lyons Kamel Mansouri Aria Smith Jennifer Smith Indira Thillainadarajah Chris Grulke Antony Williams