SlideShare ist ein Scribd-Unternehmen logo
1 von 1
Innovative Research for a Sustainable Futurewww.epa.gov/research
ORCID: 0000-0002-2668-4821
Antony Williams l williams.antony@epa.gov l 919-541-1033
The EPA Online Database of Experimental and Predicted Data to Support Environmental Scientists
Antony Williams1,*, Christopher Grulke1, Kamel Mansouri2, Jennifer Smith1, Jordan Foster1, Michelle Krzyzanowski and Jeff Edwards1
1. U.S. Environmental Protection Agency, Office of Research and Development, National Center for Computational Toxicology (NCCT), Research Triangle Park, NC
2. Oak Ridge Institute for Science and Education (ORISE) Participant, Research Triangle Park, NC
Problem Definition and Goals
Source Data and Example Errors
Automated Analysis Using KNIME
Accessing ~1 Million Predicted Properties Online
Future Work
As part of our efforts to develop a public platform to provide access to experimental data and predictive models
for physicochemical data we have delivered a web-based database, the EPA Chemistry Dashboard
(https://comptox.epa.gov). In order to develop the prediction models we used publicly available data and a
combination of manual and automated review processes to produce highly curated datasets. The automated
curation processes for the validation of the data used a KNIME workflow and included approaches to validate
different chemical structure representations (e.g. molfile and SMILES), identifiers (chemical names and registry
numbers), and methods to standardize the data into QSAR-consumable formats for modeling. Machine-learning
approaches were applied to create a series of models that have been used to generate predicted
physicochemical and environmental parameters for over 700,000 chemicals. The data have been made available
online via the EPA’s iCSS Chemistry Dashboard and includes detailed model reports regarding whether or not a
chemical is contained within a particular applicability domain, provides access to nearest neighbors within the
training set as well as detailed QMRF (QSAR Modeling Report Format) reports for each of the models.
The iCSS Chemistry Dashboard (ICD) is a public EPA-hosted web application providing access to over 700,000
chemicals from EPA’s DSSTox database5. It integrates experimental and predicted data and is a hub to other
NCCT apps and web-based resources. The 3 and 4 STAR curated experimental data are accessible via the ICD
application. All chemicals were also passed through the prediction models and detailed model reports and results
are freely available at http://comptox.epa.gov/dashboard.
The manual investigation of the data allowed us to develop a KNIME2 workflow for automated
processing. This workflow was derived from earlier work by Mansouri et. al.3 and is represented in the
figure below as a series of blocks representing, for example:
The KNIME workflow for automated processing of PHYSPROP data
The KNIME workflow was used to insert various levels of Quality Flags indicating consistency between
chemical structure formats and identifiers. The consistency flag definitions and distribution are
summarized below for the >15k chemicals.
Predictive models were developed using only the 3 and 4 star curated data.
• The PHYSPROP data were sourced online1 as SDF files. A total of 13 endpoints were represented, including
LogKow, water solubility, melting point, boiling point, and others. The largest dataset, LogKow, contained over
15,800 individual chemicals. Each data point included the Mol-Block, SMILES, CASRN, Name, LogKow value
and, where available, a reference. This dataset was chosen as representative for examining the quality of
data.
• A manual examination of the data revealed a number of issues: e.g., SMILES and Mol-Block did not agree;
CASRN did not match the correct structure; SMILES could not be converted; a single chemical structure
would be listed multiple times with different property values. Example errors across various PHYSPROP
datasets are shown below.
References
1. PHYSPROP Data: http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm
2. KNIME: https://www.knime.org/
3. Mansouri et al. CERAPP: Collaborative Estrogen Receptor Activity Prediction Project, Environ Health
Perspect; DOI:10.1289/ehp.1510267
4. PaDEL descriptors, http://padel.nus.edu.sg/software/padeldescriptor/
5. EPA Distributed Structure-Searchable Toxicity (DSSTox) Database, http://www.epa.gov/chemical-
research/distributed-structure-searchable-toxicity-dsstox-database
6. EPA Toxicity Estimation Software Tool (T.E.S.T.) software http://www.epa.gov/chemical-research/toxicity-
estimation-software-tool-test
Preparing data for QSAR Modeling • Release NCCT models as interactive online prediction tools in the near future via the ICD.
• Integrate the suite of EPA T.E.S.T6 physchem and toxicity prediction models to expand the collection of
available models.
• Source additional data to expand the training sets underpinning the prediction algorithms.
4 STAR ENHANCED: Name/CASRN/Mol/SMILES added Stereo: 550
4 STAR: 3 of 4 Name/CASRN/Mol/SMILES: 5967
3 STAR ENHANCED: 3 of 4 Name/CASRN/Mol/SMILES added Stereo: 177
3 STAR: 3 of 4 Name/CASRN/Mol/SMILES: 7910
2 STAR PLUS: 2 of 4 Name/CASRN/Mol/SMILES/Tautomer: 133
2 STAR: 2 of 4 Name/CASRN/Mol/SMILES: 1003
1 STAR: No two fields consistent 379
2627/P125
American Chemical Society Fall Meeting
Philadelphia, PA.
August 21-25, 2016
• Check names against dictionary (555 invalid)
• Assign Quality flags based on consistency
among data fields
Examples of hypervalency in LogKow dataset
Equivalent structures but with different CASRN, names and
values in MP dataset: -4 and + 70oC
Equivalent structures, different CASRN, names, and
values in MP dataset
This poster does not necessarily reflect EPA policy. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.
For the purposes of QSAR modeling, the 3 and 4 STAR datasets were
processed through a KNIME workflow. This processing removed
inorganics and mixtures, processed salts into neutral forms (except for
melting point data), normalized tautomers, and removed duplicates.
The resulting “QSAR-ready” file(s) were modeled using Genetic
Algorithm-Partial Least Squares with 5-fold cross validation and
utilizing 2D PaDEL4 molecular descriptors. Multiple modeling runs
(100) produced the best models using a minimum number of
descriptors. The models for all 13 endpoints are available as both
Windows and Linux executable binaries and as a C++ library that can
be called by a separate application.
Model Performance
The curation of the available data, utilization of large datasets for training and validation, and application of
innovative machine-learning approaches produced high performance, yet simple models with a minimum number
of descriptors. The logP model was built on a dataset of more than 14k chemicals using only 10 descriptors. With
a wider applicability domain this model outperformed EPI Suite’s predictions (left side plot) especially for the
over-estimated cluster extracted and plotted together with the new model’s predictions (right side plot).
Different structures in Mol-Block and SMILES
Statistics of the new model
5-fold cross-validation:
Q2: 0.87 RMSE: 0.67
Fitting 11247:
R2: 0.87 RMSE: 0.66
Test set prediction :
R2: 0.84 RMSE: 0.65
• Compare Mol-Block and SMILES (2268 different)
• Check for duplicates (657 structures, 531 names)
• Check CASRN Numbers (3646 invalid CASRN)
Abstract
Problem: There is limited access to freely available high quality experimental and predicted data associated with
chemical structures, and their associated substances, available online.
Goals: To deliver access to curated datasets of experimental and physicochemical data associated with
chemicals of interest to environmental scientists. Make the data available as downloadable Open data. To
develop predictive models from the curated data and use to predict properties for >700,000 chemicals and make
available online. To provide details regarding the performance of the models.
A summary of available chemical properties
for Bisphenol A
The model report for the melting point prediction
for Bisphenol A – including nearest neighbors.
EPI Suite’s predictions. EPI Suite’s (red) Vs NCCT model’s predictions (black)

Weitere ähnliche Inhalte

Was ist angesagt?

OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSOPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSKamel Mansouri
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Kamel Mansouri
 

Was ist angesagt? (20)

The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...The needs for chemistry standards, database tools and data curation at the ch...
The needs for chemistry standards, database tools and data curation at the ch...
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
 
Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings Open PHACTS Chemistry Platform Update and Learnings
Open PHACTS Chemistry Platform Update and Learnings
 
Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...Chemical identification of unknowns in high resolution mass spectrometry usin...
Chemical identification of unknowns in high resolution mass spectrometry usin...
 
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELSOPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
OPERA, AN OPEN SOURCE AND OPEN DATA SUITE OF QSAR MODELS
 
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
Structure identification by Mass Spectrometry Non-Targeted Analysis using the...
 
New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...
 
Accessing information for chemicals in hydraulic fracturing fluids using the ...
Accessing information for chemicals in hydraulic fracturing fluids using the ...Accessing information for chemicals in hydraulic fracturing fluids using the ...
Accessing information for chemicals in hydraulic fracturing fluids using the ...
 
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
US EPA CompTox Chemistry Dashboard as a source of data to fill data gaps for ...
 
Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...Non-targeted analysis supported by data and cheminformatics delivered via the...
Non-targeted analysis supported by data and cheminformatics delivered via the...
 
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...Structure identification approaches using the EPA CompTox Chemicals Dashboard...
Structure identification approaches using the EPA CompTox Chemicals Dashboard...
 
Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...Adding complex expert knowledge into chemical database and transforming surfa...
Adding complex expert knowledge into chemical database and transforming surfa...
 
Development of a Tool for Systematic Integration of Traditional and New Appro...
Development of a Tool for Systematic Integration of Traditional and New Appro...Development of a Tool for Systematic Integration of Traditional and New Appro...
Development of a Tool for Systematic Integration of Traditional and New Appro...
 
Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...Free online access to experimental and predicted chemical properties through ...
Free online access to experimental and predicted chemical properties through ...
 
Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases? Does bigger mean better in the world of chemistry databases?
Does bigger mean better in the world of chemistry databases?
 
US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
US-EPA CompTox Chemicals Dashboard – integrating chemistry and biology data t...
 
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
 
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
Using the US EPA’s CompTox Chemistry Dashboard for structure identification a...
 
New Approach Methods - What is That?
New Approach Methods - What is That?New Approach Methods - What is That?
New Approach Methods - What is That?
 
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
US-EPA CompTox Chemicals Dashboard providing access to experimental and predi...
 

Andere mochten auch

Conférence débat sur les réseaux sociaux
Conférence débat sur les réseaux sociauxConférence débat sur les réseaux sociaux
Conférence débat sur les réseaux sociauxRégis Gautheron
 
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Yue Liao
 
From Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectFrom Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectJoel Gehman
 
SMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementSMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementJoel Gehman
 
Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...NASIG
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...Andrew McEachran
 
Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...lisbk
 
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...Joel Gehman
 

Andere mochten auch (19)

Conférence débat sur les réseaux sociaux
Conférence débat sur les réseaux sociauxConférence débat sur les réseaux sociaux
Conférence débat sur les réseaux sociaux
 
Social Networking Tools for Scientists and Building an Online Profile
Social Networking Tools for Scientists and Building an Online ProfileSocial Networking Tools for Scientists and Building an Online Profile
Social Networking Tools for Scientists and Building an Online Profile
 
Building an Online Profile: Social Networking and Amplification Tools for Sc...
Building an Online Profile:  Social Networking and Amplification Tools for Sc...Building an Online Profile:  Social Networking and Amplification Tools for Sc...
Building an Online Profile: Social Networking and Amplification Tools for Sc...
 
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
Using Ecological Momentary Assessment to Examine Post-food Consumption Affect...
 
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
Simple Springshare Mashups: Cross-Platform Strategies for Repurposing Digital...
 
NSF Data Management Requirements 101
NSF Data Management Requirements 101NSF Data Management Requirements 101
NSF Data Management Requirements 101
 
How One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online ChemistryHow One Monkey on a Typewriter Made a Difference to Online Chemistry
How One Monkey on a Typewriter Made a Difference to Online Chemistry
 
From Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki ProjectFrom Data Availability to Information Accessibility: The WellWiki Project
From Data Availability to Information Accessibility: The WellWiki Project
 
SMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic ManagementSMS Berlin 2016 Cultural Perspectives on Strategic Management
SMS Berlin 2016 Cultural Perspectives on Strategic Management
 
Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...Investigating Impact Metrics for Performance for the US-EPA National Center f...
Investigating Impact Metrics for Performance for the US-EPA National Center f...
 
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
A Bird in the Hand: Leveraging ILL Requests to Improve Electronic Resource A...
 
Social Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online ProfileSocial Media Tools for Scientists and Building an Online Profile
Social Media Tools for Scientists and Building an Online Profile
 
Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...Shaping Expectations: Defining and Refining the Role of Technical Services in...
Shaping Expectations: Defining and Refining the Role of Technical Services in...
 
Building an Online Profile Using Social Networking and Amplification Tools fo...
Building an Online Profile Using Social Networking and Amplification Tools fo...Building an Online Profile Using Social Networking and Amplification Tools fo...
Building an Online Profile Using Social Networking and Amplification Tools fo...
 
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
The EPA iCSS Chemistry Dashboard to Support Compound Identification Using Hig...
 
Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...Web Preservation, or Managing your Organisation’s Online Presence After the O...
Web Preservation, or Managing your Organisation’s Online Presence After the O...
 
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...Going Concerns:  A Perspective from the Nexus of Business, Culture and Instit...
Going Concerns: A Perspective from the Nexus of Business, Culture and Instit...
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 

Ähnlich wie The EPA Online Prediction Physicochemical Prediction Platform to Support Environmental Scientists

The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...Kamel Mansouri
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...Kamel Mansouri
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...Kamel Mansouri
 
Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Andrew McEachran
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Valery Tkachenko
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...Kamel Mansouri
 

Ähnlich wie The EPA Online Prediction Physicochemical Prediction Platform to Support Environmental Scientists (20)

The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...
The EPA Comptox Chemistry Dashboard: A Web-Based Data Integration Hub for Tox...
 
The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...The influence of data curation on QSAR Modeling – Presented at American Chemi...
The influence of data curation on QSAR Modeling – Presented at American Chemi...
 
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
The importance of data curation on QSAR Modeling: PHYSPROP open data as a cas...
 
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
 
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
Accessing information for Per- & Polyfluoroalkyl Substances using the US EPA ...
 
Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...Web-based access to experimental and predicted data for environmental fate, t...
Web-based access to experimental and predicted data for environmental fate, t...
 
An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...An examination of data quality on QSAR Modeling in regards to the environment...
An examination of data quality on QSAR Modeling in regards to the environment...
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...Developing tools for high resolution mass spectrometry-based screening via th...
Developing tools for high resolution mass spectrometry-based screening via th...
 
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted AnalysisThe US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
The US-EPA CompTox Chemicals Dashboard to support Non-Targeted Analysis
 
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
The EPA Comptox Chemicals Dashboard as a Data Integration Hub for Environment...
 
Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards Accessing Environmental Chemistry Data via Data Dashboards
Accessing Environmental Chemistry Data via Data Dashboards
 
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
 
CompTox Chemicals Dashboard: Data and tools to support chemical and environme...
CompTox Chemicals Dashboard: Data and tools to support chemical and environme...CompTox Chemicals Dashboard: Data and tools to support chemical and environme...
CompTox Chemicals Dashboard: Data and tools to support chemical and environme...
 
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
EDSP Prioritization: Collaborative Estrogen Receptor Activity Prediction Proj...
 
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
TRIANGLE AREA MASS SPECTOMETRY MEETING: Structure Identification Approaches U...
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Medical science
Medical scienceMedical science
Medical science
 
Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...Incorporating new technologies and High Throughput Screening in the design an...
Incorporating new technologies and High Throughput Screening in the design an...
 
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
The EPA CompTox Chemistry Dashboard and Underpinning Software Architecture
 

Kürzlich hochgeladen

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 

Kürzlich hochgeladen (20)

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 

The EPA Online Prediction Physicochemical Prediction Platform to Support Environmental Scientists

  • 1. Innovative Research for a Sustainable Futurewww.epa.gov/research ORCID: 0000-0002-2668-4821 Antony Williams l williams.antony@epa.gov l 919-541-1033 The EPA Online Database of Experimental and Predicted Data to Support Environmental Scientists Antony Williams1,*, Christopher Grulke1, Kamel Mansouri2, Jennifer Smith1, Jordan Foster1, Michelle Krzyzanowski and Jeff Edwards1 1. U.S. Environmental Protection Agency, Office of Research and Development, National Center for Computational Toxicology (NCCT), Research Triangle Park, NC 2. Oak Ridge Institute for Science and Education (ORISE) Participant, Research Triangle Park, NC Problem Definition and Goals Source Data and Example Errors Automated Analysis Using KNIME Accessing ~1 Million Predicted Properties Online Future Work As part of our efforts to develop a public platform to provide access to experimental data and predictive models for physicochemical data we have delivered a web-based database, the EPA Chemistry Dashboard (https://comptox.epa.gov). In order to develop the prediction models we used publicly available data and a combination of manual and automated review processes to produce highly curated datasets. The automated curation processes for the validation of the data used a KNIME workflow and included approaches to validate different chemical structure representations (e.g. molfile and SMILES), identifiers (chemical names and registry numbers), and methods to standardize the data into QSAR-consumable formats for modeling. Machine-learning approaches were applied to create a series of models that have been used to generate predicted physicochemical and environmental parameters for over 700,000 chemicals. The data have been made available online via the EPA’s iCSS Chemistry Dashboard and includes detailed model reports regarding whether or not a chemical is contained within a particular applicability domain, provides access to nearest neighbors within the training set as well as detailed QMRF (QSAR Modeling Report Format) reports for each of the models. The iCSS Chemistry Dashboard (ICD) is a public EPA-hosted web application providing access to over 700,000 chemicals from EPA’s DSSTox database5. It integrates experimental and predicted data and is a hub to other NCCT apps and web-based resources. The 3 and 4 STAR curated experimental data are accessible via the ICD application. All chemicals were also passed through the prediction models and detailed model reports and results are freely available at http://comptox.epa.gov/dashboard. The manual investigation of the data allowed us to develop a KNIME2 workflow for automated processing. This workflow was derived from earlier work by Mansouri et. al.3 and is represented in the figure below as a series of blocks representing, for example: The KNIME workflow for automated processing of PHYSPROP data The KNIME workflow was used to insert various levels of Quality Flags indicating consistency between chemical structure formats and identifiers. The consistency flag definitions and distribution are summarized below for the >15k chemicals. Predictive models were developed using only the 3 and 4 star curated data. • The PHYSPROP data were sourced online1 as SDF files. A total of 13 endpoints were represented, including LogKow, water solubility, melting point, boiling point, and others. The largest dataset, LogKow, contained over 15,800 individual chemicals. Each data point included the Mol-Block, SMILES, CASRN, Name, LogKow value and, where available, a reference. This dataset was chosen as representative for examining the quality of data. • A manual examination of the data revealed a number of issues: e.g., SMILES and Mol-Block did not agree; CASRN did not match the correct structure; SMILES could not be converted; a single chemical structure would be listed multiple times with different property values. Example errors across various PHYSPROP datasets are shown below. References 1. PHYSPROP Data: http://esc.syrres.com/interkow/EpiSuiteData_ISIS_SDF.htm 2. KNIME: https://www.knime.org/ 3. Mansouri et al. CERAPP: Collaborative Estrogen Receptor Activity Prediction Project, Environ Health Perspect; DOI:10.1289/ehp.1510267 4. PaDEL descriptors, http://padel.nus.edu.sg/software/padeldescriptor/ 5. EPA Distributed Structure-Searchable Toxicity (DSSTox) Database, http://www.epa.gov/chemical- research/distributed-structure-searchable-toxicity-dsstox-database 6. EPA Toxicity Estimation Software Tool (T.E.S.T.) software http://www.epa.gov/chemical-research/toxicity- estimation-software-tool-test Preparing data for QSAR Modeling • Release NCCT models as interactive online prediction tools in the near future via the ICD. • Integrate the suite of EPA T.E.S.T6 physchem and toxicity prediction models to expand the collection of available models. • Source additional data to expand the training sets underpinning the prediction algorithms. 4 STAR ENHANCED: Name/CASRN/Mol/SMILES added Stereo: 550 4 STAR: 3 of 4 Name/CASRN/Mol/SMILES: 5967 3 STAR ENHANCED: 3 of 4 Name/CASRN/Mol/SMILES added Stereo: 177 3 STAR: 3 of 4 Name/CASRN/Mol/SMILES: 7910 2 STAR PLUS: 2 of 4 Name/CASRN/Mol/SMILES/Tautomer: 133 2 STAR: 2 of 4 Name/CASRN/Mol/SMILES: 1003 1 STAR: No two fields consistent 379 2627/P125 American Chemical Society Fall Meeting Philadelphia, PA. August 21-25, 2016 • Check names against dictionary (555 invalid) • Assign Quality flags based on consistency among data fields Examples of hypervalency in LogKow dataset Equivalent structures but with different CASRN, names and values in MP dataset: -4 and + 70oC Equivalent structures, different CASRN, names, and values in MP dataset This poster does not necessarily reflect EPA policy. Mention of trade names or commercial products does not constitute endorsement or recommendation for use. For the purposes of QSAR modeling, the 3 and 4 STAR datasets were processed through a KNIME workflow. This processing removed inorganics and mixtures, processed salts into neutral forms (except for melting point data), normalized tautomers, and removed duplicates. The resulting “QSAR-ready” file(s) were modeled using Genetic Algorithm-Partial Least Squares with 5-fold cross validation and utilizing 2D PaDEL4 molecular descriptors. Multiple modeling runs (100) produced the best models using a minimum number of descriptors. The models for all 13 endpoints are available as both Windows and Linux executable binaries and as a C++ library that can be called by a separate application. Model Performance The curation of the available data, utilization of large datasets for training and validation, and application of innovative machine-learning approaches produced high performance, yet simple models with a minimum number of descriptors. The logP model was built on a dataset of more than 14k chemicals using only 10 descriptors. With a wider applicability domain this model outperformed EPI Suite’s predictions (left side plot) especially for the over-estimated cluster extracted and plotted together with the new model’s predictions (right side plot). Different structures in Mol-Block and SMILES Statistics of the new model 5-fold cross-validation: Q2: 0.87 RMSE: 0.67 Fitting 11247: R2: 0.87 RMSE: 0.66 Test set prediction : R2: 0.84 RMSE: 0.65 • Compare Mol-Block and SMILES (2268 different) • Check for duplicates (657 structures, 531 names) • Check CASRN Numbers (3646 invalid CASRN) Abstract Problem: There is limited access to freely available high quality experimental and predicted data associated with chemical structures, and their associated substances, available online. Goals: To deliver access to curated datasets of experimental and physicochemical data associated with chemicals of interest to environmental scientists. Make the data available as downloadable Open data. To develop predictive models from the curated data and use to predict properties for >700,000 chemicals and make available online. To provide details regarding the performance of the models. A summary of available chemical properties for Bisphenol A The model report for the melting point prediction for Bisphenol A – including nearest neighbors. EPI Suite’s predictions. EPI Suite’s (red) Vs NCCT model’s predictions (black)