SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Data Enhancing the
         RSC Archive
   Colin Batchelor, Ken Karapetyan, Alexey
Pshenichov, Dave Sharpe, Jon Steele, Valery
           Tkachenko and Antony Williams
              ACS New Orleans April 2013
Overview
•   The big picture
•   Where we’ve been
•   Statistics as well as semantics
•   New directions in experimental data
•   Where we’re going
The big picture
We have journal articles going back to 1841 and the
aim is to extract:
•Every small molecule we can (graphics and text)
•Reactions
•Spectra
•Data in tables
and classify every paper in a way that makes sense
to the reader.
Background
• RSC Publishing moved to an all-XML workflow
  at the turn of the millennium.
• We digitized the backfile (to 1841) in 2005.
• We launched Project Prospect in 2007.
• We acquired ChemSpider in 2009.
RSC Advances

New high-volume journal covering all of chemistry
  launched in 2011.

Need a sensible way of navigating all this.

http://www.rsc.org/advances
http://www.rsc.org/RSCAdvancesSubjects
Strategy

• Use topic modelling: latent Dirichlet allocation (LDA)
  and Gibbs sampling to determine a set of “true” topics
Thomas L. Griffiths and Mark Steyvers, “Finding scientific topics”, Proc. Natl. Acad. Sci. USA, 2004, 101, 5228–5235.




• Publishing expertise gives us 12 broad subjects that
  will be intuitive to users
• Merge first set to form second
• Tweak
Classify that classification
Generated 128 topics based on 2009 and 2010’s
 articles (> 20000 papers).

Generated Wordle images (www.wordle.net) of
 the topics for internal staff.
Classify that classification: results
7 topics (75, 57, 65, 67, 82, 113, 123) were
  rejected for being nonsense.
1 topic (127) was rejected for being too general.
120 topics were classified under the 12 headings
  and given names.

Examples…
Examples
1: “kinetics” → Physical
2: “coordination complexes” → Inorganic
3: “general materials” → Materials
4: “misc. organic” → Organic
5: “bacteria” → Biological + Food and health
6: “theoretical” → Physical
7: “cells” → Bio
8: “water and solution chemistry” → Physical
9: “gels” → Materials
10: “inorganic material properties” → Physical + Inorganic + Materials
11: “general organic” → Organic
12: “coordination chemistry” → Inorganic
13: “photochemistry” → Inorganic + Materials + Energy
“Very useful!”
 “Superb!”
“… will make it
easier for
readers to
identify papers
which might be
interesting to
them.”
What now?
Shortly rolling out the subject classification to
other general journals:
•Chemical Communications
•Chemical Science
•Journal of Materials Chemistry A, B and C
•New Journal of Chemistry
Beyond Prospect: further steps in
           text-mining
Migration to Oscar 4
https://bitbucket.org/wwmm/oscar4/wiki/Home
Multiple name to structure engines
      OPSIN, ACD/Labs, Lexichem
ACD/Labs Dictionary
Better disambiguation
Parallelization with Hadoop
Structure validation and standardization (see later)
Reaction extraction from text (see later)
On an experimental
run with names from
Organic and
Biomolecular Chemistry

Is any structure
returned at all by a
given n2s engine?

Lexichem = a (2798)
ACD = b (3049)
OPSIN = c (3309)
Structure
disagreements

Out of 2588 names
where at least one of
the engines differed
or didn’t return a
result:

A = ACD
(1538 in total)
B = Lexichem
(1301 in total)
C = OPSIN
(2097 in total)
Iterations
With the Hadoop cluster, we can mine
thousands of articles a night.

We’re initially iterating over the material back to
2000, for which we have native XML. Then it’s a
case of going back and testing out the OCRed
material.
http://cv.beta.rsc-us.org/
This is the beta site for
•Extracting chemical structures from ChemDraw
files
•Most importantly: structure validation and
standardization

We will be using this for all of the extracted
structures.
Reaction extraction from text



We have had some preliminary experience of this with Daniel
Lowe (NextMove, formerly Cambridge)’s ChemicalTagger
work.

To go to ChemSpider Reactions:
       http://csr.dev.rsc-us.org/
Experimental data
We’ve already seen the possibilities for
extracting data from organic experimental
sections, but what about other sorts of data?

Given chemical structures and extracted data
we may be able to start building models and
making them available.
New directions in experimental
             data (1)
We are working with William Brouwer (Penn
State) to extract data from graphs.

Obviously this is faute de mieux and we’d rather
have the original data, but we’re giving a flavour
of what might be possible.
Recent Work
Digitized Spectrum
Comparison of Spectra
And now on ChemSpider…
New directions in experimental
             data (2)
Dye solar cell data is every bit as systematic as
organic experimental sections.
Human curation of results
Previously: built into partly-manual annotation
workflow.

Currently: macro-scale, iterative.

Coming: Challenger
DERA
• DERA will unveil from our archive
  – Chemicals
  – Reactions
  – Figures
  – Spectra/Analytical Data
  – Property Data

  – And yes….it will need curation and filtering!

Weitere ähnliche Inhalte

Was ist angesagt?

UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...
UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...
UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...
ukcorr
 

Was ist angesagt? (20)

Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Accelerating materials design through natural language processing
Accelerating materials design through natural language processingAccelerating materials design through natural language processing
Accelerating materials design through natural language processing
 
ICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials ProjectICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials Project
 
Bio solr building a better search for bioinformatics
Bio solr   building a better search for bioinformaticsBio solr   building a better search for bioinformatics
Bio solr building a better search for bioinformatics
 
UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...
UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...
UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...
 
Open science 2014
Open science 2014Open science 2014
Open science 2014
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...
 
UPennONS
UPennONSUPennONS
UPennONS
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Reaxys structure searching
Reaxys structure searchingReaxys structure searching
Reaxys structure searching
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
Websci17 final
Websci17 finalWebsci17 final
Websci17 final
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...
 
SWAT4LS 2014 SLIDE by Yamamoto
SWAT4LS 2014 SLIDE by YamamotoSWAT4LS 2014 SLIDE by Yamamoto
SWAT4LS 2014 SLIDE by Yamamoto
 

Andere mochten auch

A product-focused introduction to Machine Learning
A product-focused introduction to Machine LearningA product-focused introduction to Machine Learning
A product-focused introduction to Machine Learning
Satpreet Singh
 

Andere mochten auch (8)

Nuevos soportes 6c y d
Nuevos soportes 6c y dNuevos soportes 6c y d
Nuevos soportes 6c y d
 
A product-focused introduction to Machine Learning
A product-focused introduction to Machine LearningA product-focused introduction to Machine Learning
A product-focused introduction to Machine Learning
 
Sifət (1) powerpoint
Sifət (1) powerpointSifət (1) powerpoint
Sifət (1) powerpoint
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming Convention
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Ähnlich wie Digitally enabling the RSC archive

Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
aimsnist
 
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Ähnlich wie Digitally enabling the RSC archive (20)

Digitally enabling the RSC archive
Digitally enabling the RSC archiveDigitally enabling the RSC archive
Digitally enabling the RSC archive
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
Predicting Molecular Properties
Predicting Molecular PropertiesPredicting Molecular Properties
Predicting Molecular Properties
 
Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...
 
ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...
 
OpenSciNY Open Notebook Science
OpenSciNY Open Notebook ScienceOpenSciNY Open Notebook Science
OpenSciNY Open Notebook Science
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
The eCrystals Federation
The eCrystals FederationThe eCrystals Federation
The eCrystals Federation
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 
Mining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsMining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning models
 
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
 
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Digitally enabling the RSC archive

  • 1. Data Enhancing the RSC Archive Colin Batchelor, Ken Karapetyan, Alexey Pshenichov, Dave Sharpe, Jon Steele, Valery Tkachenko and Antony Williams ACS New Orleans April 2013
  • 2. Overview • The big picture • Where we’ve been • Statistics as well as semantics • New directions in experimental data • Where we’re going
  • 3. The big picture We have journal articles going back to 1841 and the aim is to extract: •Every small molecule we can (graphics and text) •Reactions •Spectra •Data in tables and classify every paper in a way that makes sense to the reader.
  • 4. Background • RSC Publishing moved to an all-XML workflow at the turn of the millennium. • We digitized the backfile (to 1841) in 2005. • We launched Project Prospect in 2007. • We acquired ChemSpider in 2009.
  • 5. RSC Advances New high-volume journal covering all of chemistry launched in 2011. Need a sensible way of navigating all this. http://www.rsc.org/advances http://www.rsc.org/RSCAdvancesSubjects
  • 6. Strategy • Use topic modelling: latent Dirichlet allocation (LDA) and Gibbs sampling to determine a set of “true” topics Thomas L. Griffiths and Mark Steyvers, “Finding scientific topics”, Proc. Natl. Acad. Sci. USA, 2004, 101, 5228–5235. • Publishing expertise gives us 12 broad subjects that will be intuitive to users • Merge first set to form second • Tweak
  • 7. Classify that classification Generated 128 topics based on 2009 and 2010’s articles (> 20000 papers). Generated Wordle images (www.wordle.net) of the topics for internal staff.
  • 8.
  • 9. Classify that classification: results 7 topics (75, 57, 65, 67, 82, 113, 123) were rejected for being nonsense. 1 topic (127) was rejected for being too general. 120 topics were classified under the 12 headings and given names. Examples…
  • 10. Examples 1: “kinetics” → Physical 2: “coordination complexes” → Inorganic 3: “general materials” → Materials 4: “misc. organic” → Organic 5: “bacteria” → Biological + Food and health 6: “theoretical” → Physical 7: “cells” → Bio 8: “water and solution chemistry” → Physical 9: “gels” → Materials 10: “inorganic material properties” → Physical + Inorganic + Materials 11: “general organic” → Organic 12: “coordination chemistry” → Inorganic 13: “photochemistry” → Inorganic + Materials + Energy
  • 11. “Very useful!” “Superb!” “… will make it easier for readers to identify papers which might be interesting to them.”
  • 12. What now? Shortly rolling out the subject classification to other general journals: •Chemical Communications •Chemical Science •Journal of Materials Chemistry A, B and C •New Journal of Chemistry
  • 13. Beyond Prospect: further steps in text-mining Migration to Oscar 4 https://bitbucket.org/wwmm/oscar4/wiki/Home Multiple name to structure engines OPSIN, ACD/Labs, Lexichem ACD/Labs Dictionary Better disambiguation Parallelization with Hadoop Structure validation and standardization (see later) Reaction extraction from text (see later)
  • 14. On an experimental run with names from Organic and Biomolecular Chemistry Is any structure returned at all by a given n2s engine? Lexichem = a (2798) ACD = b (3049) OPSIN = c (3309)
  • 15. Structure disagreements Out of 2588 names where at least one of the engines differed or didn’t return a result: A = ACD (1538 in total) B = Lexichem (1301 in total) C = OPSIN (2097 in total)
  • 16. Iterations With the Hadoop cluster, we can mine thousands of articles a night. We’re initially iterating over the material back to 2000, for which we have native XML. Then it’s a case of going back and testing out the OCRed material.
  • 17. http://cv.beta.rsc-us.org/ This is the beta site for •Extracting chemical structures from ChemDraw files •Most importantly: structure validation and standardization We will be using this for all of the extracted structures.
  • 18.
  • 19.
  • 20. Reaction extraction from text We have had some preliminary experience of this with Daniel Lowe (NextMove, formerly Cambridge)’s ChemicalTagger work. To go to ChemSpider Reactions: http://csr.dev.rsc-us.org/
  • 21. Experimental data We’ve already seen the possibilities for extracting data from organic experimental sections, but what about other sorts of data? Given chemical structures and extracted data we may be able to start building models and making them available.
  • 22. New directions in experimental data (1) We are working with William Brouwer (Penn State) to extract data from graphs. Obviously this is faute de mieux and we’d rather have the original data, but we’re giving a flavour of what might be possible.
  • 26. And now on ChemSpider…
  • 27.
  • 28. New directions in experimental data (2) Dye solar cell data is every bit as systematic as organic experimental sections.
  • 29. Human curation of results Previously: built into partly-manual annotation workflow. Currently: macro-scale, iterative. Coming: Challenger
  • 30. DERA • DERA will unveil from our archive – Chemicals – Reactions – Figures – Spectra/Analytical Data – Property Data – And yes….it will need curation and filtering!