SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Data enhancing the Royal
Society of Chemistry
publication archive
Antony Williams, Colin Batchelor,
Peter Corbett, Ken Karapetyan and
Valery Tkachenko
ACS Dallas
March 2014
Data Enhancing the RSC
Archive
• Publications summarise
data acquisition, analysis
and conclusions.
• Much detail in the data
• Improved navigation
includes data access
• Reanalysis of data is
limited in PDFs
Data enhancing the royal society of chemistry publication archive
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-
thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5
ml ) and benzene ( 50 ml ) were charged into a glass reaction
vessel equipped with a mechanical stirrer , thermometer and
reflux condenser .
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were
stripped from the reaction mixture under reduced pressure to
yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-
trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
How is DERA going? TEXT
• We have text-mined all 21st
century articles…
>100k articles from 2000-2013
• Mostly marked up with XML, more structured,
easier to handle. Markup mostly published onto
the HTML forms of the articles
• Required multiple iterations based on
dictionaries, markup, OSCAR extraction
• New visualization approaches in development
Chemical Validation and
Standardization
The RSC Data
Repository
Deposition Gateway
Staging
databases
Compounds
Reactions
Spectra
Materials
Articles / CSSP
Compounds
Module
Spectra
Module
Reactions
Module
Materials
Module
Textmining
Module
͙
Module
Web UI for unified depositions
DropBox, Google Drive,
SkyDrive, etc
LabTroveand other templated
data
Documents
API, FTP, etc
Raw data Validated data
Staging
databases
Alldatabases are
sliced by data
sources/data
collections and
havesimple
security model
where each data
slice/sourceis
private, public or
embargoed
Text-Mining
ChemSpider Reactions
Reactions
• We will put reactions from our databases into
the Reactions Repository
• We will use “Reaction Validation” procedures
to clean up Daniel Lowe’s USPTO patent set
of over a million extracted reactions
• We will move ChemSpider SyntheticPages
content to the Reactions Repository
• We will use the RXNO Ontology to classify
the reactions
Reaction Deposition/Validation
ESI – Text Spectra
Lots of “Textual Spectra”
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35
(t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8
Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane),
30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19,
120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42,
123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
How is DERA going? Text Spectra
• Overall progress is good
• Improved algorithms for extraction of spectra
• Extraction of associated compound name
with spectrum – name to structure
conversion now
• MestreLabs have provided us with batch
conversion tool
• Work in progress – manual and automated
validation. In theory auto-assignment also
Visualization of Spectra
• For spectra associated with compounds we
would like to view “interactive spectra”
Javascript viewer with JMol
Figure Spectra into “Real
Spectra”?
• We are turning text into structures
• We are turning text into spectra
• And we are turning figures into spectra
Turn “Figures” Into Data
EXTRACTED
DATA
FIGURE
Data enhancing the royal society of chemistry publication archive
EXTRACTED
DATA
FIGURE
How is DERA going? Figures
• Validation tests performed with William
Brouwer. Good enough to proceed with
larger test set
• Ready to run process across larger collection
• Focus on 21st
century articles only for now
Early Test Experiments

Input : 74 supplementary data documents/ 3444 pages

Output : p2t extracted content in 1069 page instances
− 578 molecules

~ 10% false positives eg., classifies Bruker logo as
chemical object

~ 20% false negatives eg., missing some symbols
from structure
− 1151 spectra

> 80% of peaks extracted to within 1-2 decimal
places (ppm)
Validating Spectra
• How will we check data consistency?
• How do we know the structure and the
spectra match? Comparing image to
spectrum is NOT enough!!!
• Predict spectra, use spectral verification, use
algorithmic checking.
• Flag “dodgy data” and use crowdsourcing for
data checking
• MULTIPLE prediction technologies now
available – VERIFICATION is tougher
What are we extracting?
• Compounds from compound names
• Reactions from the text
• Spectral extraction – from figures and text
• Extraction of data from “tables” – not only
CSV files but literal tables in the publication –
specifically data from MedChemComm as
proof of concept
Building out the technology
• We are presently Open-Sourcing a chemical
registration system developed for OpenPHACTS
• We will then Open Source the Chemical
Validation and Standardization Platform
• We are working with Bob Hanson and Bob
Lancashire on Jmol/JSpecView Open Source
• We will deliver a set of Open Source widgets for
structure handling/visualization
Javascript viewer NMR, MS, IR
Grand Target
• Fingers crossed to get 21st
century spectra
converted
• Spectra associated with compounds will go
into ChemSpider
• Spectra converted from Figures but without
compound association will be captured with
Figures into the Data Repository
• Focus on IR, Raman, UV-Vis & 1D NMR
DERA is FINE for an archive
The WRONG WAY otherwise!
• We should NOT be mining data out of future
publications
• Structures should be submitted “correctly”
• Spectra should be digital spectral formats,
not images
• ESI should be RICH and interactive
• Data should be open, available, with meta
data and provenance
We can solve for Authors here
Will it be used though???
Advanced ESI
Conclusions
• Great progress in mining the archive and 21st
century articles are being enhanced on the
publishing platform iteratively
• Spectral Data is the next focus – directly
connected to our work on the data repository
• Reaction extraction, processing and
validation from articles is progressing more
slowly
• Results are content, software components
and and Open Source Contributions
Acknowledgments
• Bill Brouwer – Plot2Txt Development
• Carlos Cobas and Santi Dominguez
• Bob Hanson and Bob Lancashire for
Jmol/JSpecView Javascript version
• Leah McEwan and Will Dichtel
• ACD/Labs – Provider of spectroscopy tools
Thank you
Email: williamsa@rsc.org
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

Weitere ähnliche Inhalte

Ähnlich wie Data enhancing the royal society of chemistry publication archive

Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...InsideScientific
 
Morgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distMorgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distddm314
 
Chemical Analysis Facility
Chemical Analysis FacilityChemical Analysis Facility
Chemical Analysis Facilitychristinejcardin
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...Ken Karapetyan
 
Kobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectKobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectMaho Nakata
 
Preservation And Reuse In High Energy Physics Salvatore Mele
Preservation And Reuse In High Energy Physics Salvatore MelePreservation And Reuse In High Energy Physics Salvatore Mele
Preservation And Reuse In High Energy Physics Salvatore MeleDigitalPreservationEurope
 
Chap1 intro to-accelerators_final
Chap1 intro to-accelerators_finalChap1 intro to-accelerators_final
Chap1 intro to-accelerators_finalSanjay Dubey
 

Ähnlich wie Data enhancing the royal society of chemistry publication archive (20)

Experiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the CommunityExperiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the Community
 
A chemistry data repository to serve them all
A chemistry data repository to serve them allA chemistry data repository to serve them all
A chemistry data repository to serve them all
 
Importance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistryImportance of data standards for large scale data integration in chemistry
Importance of data standards for large scale data integration in chemistry
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...
 
Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...Delivering The Benefits of Chemical-Biological Integration in Computational T...
Delivering The Benefits of Chemical-Biological Integration in Computational T...
 
Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
 
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
 
Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...
 
Resume2015 research
Resume2015 researchResume2015 research
Resume2015 research
 
Morgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 distMorgan uw maGIV v1.3 dist
Morgan uw maGIV v1.3 dist
 
Chemical Analysis Facility
Chemical Analysis FacilityChemical Analysis Facility
Chemical Analysis Facility
 
The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
 
Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...
 
Kobeworkshop pubchemqc project
Kobeworkshop pubchemqc projectKobeworkshop pubchemqc project
Kobeworkshop pubchemqc project
 
NOMAD
NOMADNOMAD
NOMAD
 
Preservation And Reuse In High Energy Physics Salvatore Mele
Preservation And Reuse In High Energy Physics Salvatore MelePreservation And Reuse In High Energy Physics Salvatore Mele
Preservation And Reuse In High Energy Physics Salvatore Mele
 
Chap1 intro to-accelerators_final
Chap1 intro to-accelerators_finalChap1 intro to-accelerators_final
Chap1 intro to-accelerators_final
 

Kürzlich hochgeladen

Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 

Kürzlich hochgeladen (20)

Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 

Data enhancing the royal society of chemistry publication archive

  • 1. Data enhancing the Royal Society of Chemistry publication archive Antony Williams, Colin Batchelor, Peter Corbett, Ken Karapetyan and Valery Tkachenko ACS Dallas March 2014
  • 2. Data Enhancing the RSC Archive • Publications summarise data acquisition, analysis and conclusions. • Much detail in the data • Improved navigation includes data access • Reanalysis of data is limited in PDFs
  • 4. Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser . The reaction mixture was heated at reflux with stirring , for a period of about one-half hour . After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
  • 5. How is DERA going? TEXT • We have text-mined all 21st century articles… >100k articles from 2000-2013 • Mostly marked up with XML, more structured, easier to handle. Markup mostly published onto the HTML forms of the articles • Required multiple iterations based on dictionaries, markup, OSCAR extraction • New visualization approaches in development
  • 7. The RSC Data Repository Deposition Gateway Staging databases Compounds Reactions Spectra Materials Articles / CSSP Compounds Module Spectra Module Reactions Module Materials Module Textmining Module ͙ Module Web UI for unified depositions DropBox, Google Drive, SkyDrive, etc LabTroveand other templated data Documents API, FTP, etc Raw data Validated data Staging databases Alldatabases are sliced by data sources/data collections and havesimple security model where each data slice/sourceis private, public or embargoed
  • 10. Reactions • We will put reactions from our databases into the Reactions Repository • We will use “Reaction Validation” procedures to clean up Daniel Lowe’s USPTO patent set of over a million extracted reactions • We will move ChemSpider SyntheticPages content to the Reactions Repository • We will use the RXNO Ontology to classify the reactions
  • 12. ESI – Text Spectra
  • 13. Lots of “Textual Spectra”
  • 14. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  • 15. 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  • 16. How is DERA going? Text Spectra • Overall progress is good • Improved algorithms for extraction of spectra • Extraction of associated compound name with spectrum – name to structure conversion now • MestreLabs have provided us with batch conversion tool • Work in progress – manual and automated validation. In theory auto-assignment also
  • 17. Visualization of Spectra • For spectra associated with compounds we would like to view “interactive spectra”
  • 19. Figure Spectra into “Real Spectra”? • We are turning text into structures • We are turning text into spectra • And we are turning figures into spectra
  • 20. Turn “Figures” Into Data EXTRACTED DATA FIGURE
  • 23. How is DERA going? Figures • Validation tests performed with William Brouwer. Good enough to proceed with larger test set • Ready to run process across larger collection • Focus on 21st century articles only for now
  • 24. Early Test Experiments  Input : 74 supplementary data documents/ 3444 pages  Output : p2t extracted content in 1069 page instances − 578 molecules  ~ 10% false positives eg., classifies Bruker logo as chemical object  ~ 20% false negatives eg., missing some symbols from structure − 1151 spectra  > 80% of peaks extracted to within 1-2 decimal places (ppm)
  • 25. Validating Spectra • How will we check data consistency? • How do we know the structure and the spectra match? Comparing image to spectrum is NOT enough!!! • Predict spectra, use spectral verification, use algorithmic checking. • Flag “dodgy data” and use crowdsourcing for data checking • MULTIPLE prediction technologies now available – VERIFICATION is tougher
  • 26. What are we extracting? • Compounds from compound names • Reactions from the text • Spectral extraction – from figures and text • Extraction of data from “tables” – not only CSV files but literal tables in the publication – specifically data from MedChemComm as proof of concept
  • 27. Building out the technology • We are presently Open-Sourcing a chemical registration system developed for OpenPHACTS • We will then Open Source the Chemical Validation and Standardization Platform • We are working with Bob Hanson and Bob Lancashire on Jmol/JSpecView Open Source • We will deliver a set of Open Source widgets for structure handling/visualization
  • 29. Grand Target • Fingers crossed to get 21st century spectra converted • Spectra associated with compounds will go into ChemSpider • Spectra converted from Figures but without compound association will be captured with Figures into the Data Repository • Focus on IR, Raman, UV-Vis & 1D NMR
  • 30. DERA is FINE for an archive The WRONG WAY otherwise! • We should NOT be mining data out of future publications • Structures should be submitted “correctly” • Spectra should be digital spectral formats, not images • ESI should be RICH and interactive • Data should be open, available, with meta data and provenance
  • 31. We can solve for Authors here Will it be used though???
  • 33. Conclusions • Great progress in mining the archive and 21st century articles are being enhanced on the publishing platform iteratively • Spectral Data is the next focus – directly connected to our work on the data repository • Reaction extraction, processing and validation from articles is progressing more slowly • Results are content, software components and and Open Source Contributions
  • 34. Acknowledgments • Bill Brouwer – Plot2Txt Development • Carlos Cobas and Santi Dominguez • Bob Hanson and Bob Lancashire for Jmol/JSpecView Javascript version • Leah McEwan and Will Dichtel • ACD/Labs – Provider of spectroscopy tools
  • 35. Thank you Email: williamsa@rsc.org ORCID: 0000-0002-2668-4821 Twitter: @ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams