SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Data Enhancing the
         RSC Archive
   Colin Batchelor, Ken Karapetyan, Alexey
Pshenichov, Dave Sharpe, Jon Steele, Valery
           Tkachenko and Antony Williams
              ACS New Orleans April 2013
Overview
•   The big picture
•   Where we’ve been
•   Statistics as well as semantics
•   New directions in experimental data
•   Where we’re going
The big picture
We have journal articles going back to 1841 and the
aim is to extract:
•Every small molecule we can (graphics and text)
•Reactions
•Spectra
•Data in tables
and classify every paper in a way that makes sense
to the reader.
Background
• RSC Publishing moved to an all-XML workflow
  at the turn of the millennium.
• We digitized the backfile (to 1841) in 2005.
• We launched Project Prospect in 2007.
• We acquired ChemSpider in 2009.
RSC Advances

New high-volume journal covering all of chemistry
  launched in 2011.

Need a sensible way of navigating all this.

http://www.rsc.org/advances
http://www.rsc.org/RSCAdvancesSubjects
Strategy

• Use topic modelling: latent Dirichlet allocation (LDA)
  and Gibbs sampling to determine a set of “true” topics
Thomas L. Griffiths and Mark Steyvers, “Finding scientific topics”, Proc. Natl. Acad. Sci. USA, 2004, 101, 5228–5235.




• Publishing expertise gives us 12 broad subjects that
  will be intuitive to users
• Merge first set to form second
• Tweak
Classify that classification
Generated 128 topics based on 2009 and 2010’s
 articles (> 20000 papers).

Generated Wordle images (www.wordle.net) of
 the topics for internal staff.
Classify that classification: results
7 topics (75, 57, 65, 67, 82, 113, 123) were
  rejected for being nonsense.
1 topic (127) was rejected for being too general.
120 topics were classified under the 12 headings
  and given names.

Examples…
Examples
1: “kinetics” → Physical
2: “coordination complexes” → Inorganic
3: “general materials” → Materials
4: “misc. organic” → Organic
5: “bacteria” → Biological + Food and health
6: “theoretical” → Physical
7: “cells” → Bio
8: “water and solution chemistry” → Physical
9: “gels” → Materials
10: “inorganic material properties” → Physical + Inorganic + Materials
11: “general organic” → Organic
12: “coordination chemistry” → Inorganic
13: “photochemistry” → Inorganic + Materials + Energy
“Very useful!”
 “Superb!”
“… will make it
easier for
readers to
identify papers
which might be
interesting to
them.”
What now?
Shortly rolling out the subject classification to
other general journals:
•Chemical Communications
•Chemical Science
•Journal of Materials Chemistry A, B and C
•New Journal of Chemistry
Beyond Prospect: further steps in
           text-mining
Migration to Oscar 4
https://bitbucket.org/wwmm/oscar4/wiki/Home
Multiple name to structure engines
      OPSIN, ACD/Labs, Lexichem
ACD/Labs Dictionary
Better disambiguation
Parallelization with Hadoop
Structure validation and standardization (see later)
Reaction extraction from text (see later)
On an experimental
run with names from
Organic and
Biomolecular Chemistry

Is any structure
returned at all by a
given n2s engine?

Lexichem = a (2798)
ACD = b (3049)
OPSIN = c (3309)
Structure
disagreements

Out of 2588 names
where at least one of
the engines differed
or didn’t return a
result:

A = ACD
(1538 in total)
B = Lexichem
(1301 in total)
C = OPSIN
(2097 in total)
Iterations
With the Hadoop cluster, we can mine
thousands of articles a night.

We’re initially iterating over the material back to
2000, for which we have native XML. Then it’s a
case of going back and testing out the OCRed
material.
http://cv.beta.rsc-us.org/
This is the beta site for
•Extracting chemical structures from ChemDraw
files
•Most importantly: structure validation and
standardization

We will be using this for all of the extracted
structures.
Reaction extraction from text



We have had some preliminary experience of this with Daniel
Lowe (NextMove, formerly Cambridge)’s ChemicalTagger
work.

To go to ChemSpider Reactions:
       http://csr.dev.rsc-us.org/
Experimental data
We’ve already seen the possibilities for
extracting data from organic experimental
sections, but what about other sorts of data?

Given chemical structures and extracted data
we may be able to start building models and
making them available.
New directions in experimental
             data (1)
We are working with William Brouwer (Penn
State) to extract data from graphs.

Obviously this is faute de mieux and we’d rather
have the original data, but we’re giving a flavour
of what might be possible.
Recent Work
Digitized Spectrum
Comparison of Spectra
And now on ChemSpider…
New directions in experimental
             data (2)
Dye solar cell data is every bit as systematic as
organic experimental sections.
Human curation of results
Previously: built into partly-manual annotation
workflow.

Currently: macro-scale, iterative.

Coming: Challenger
DERA
• DERA will unveil from our archive
  – Chemicals
  – Reactions
  – Figures
  – Spectra/Analytical Data
  – Property Data

  – And yes….it will need curation and filtering!

Weitere ähnliche Inhalte

Was ist angesagt?

Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science researchAnubhav Jain
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
 
Accelerating materials design through natural language processing
Accelerating materials design through natural language processingAccelerating materials design through natural language processing
Accelerating materials design through natural language processingAnubhav Jain
 
Bio solr building a better search for bioinformatics
Bio solr   building a better search for bioinformaticsBio solr   building a better search for bioinformatics
Bio solr building a better search for bioinformaticsCharlie Hull
 
UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...
UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...
UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...ukcorr
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsAnubhav Jain
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignAnubhav Jain
 
Reaxys structure searching
Reaxys structure searchingReaxys structure searching
Reaxys structure searchingAnn-Marie Roche
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsAnubhav Jain
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Ola Spjuth
 

Was ist angesagt? (20)

Software tools to facilitate materials science research
Software tools to facilitate materials science researchSoftware tools to facilitate materials science research
Software tools to facilitate materials science research
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Accelerating materials design through natural language processing
Accelerating materials design through natural language processingAccelerating materials design through natural language processing
Accelerating materials design through natural language processing
 
ICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials ProjectICME Workshop Jul 2014 - The Materials Project
ICME Workshop Jul 2014 - The Materials Project
 
Bio solr building a better search for bioinformatics
Bio solr   building a better search for bioinformaticsBio solr   building a better search for bioinformatics
Bio solr building a better search for bioinformatics
 
UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...
UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...
UKCORR members day 2019: If you’ve got it, flaunt it: Repository improvements...
 
Open science 2014
Open science 2014Open science 2014
Open science 2014
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...Dealing with the complex challenge of managing diverse analytical chemistry d...
Dealing with the complex challenge of managing diverse analytical chemistry d...
 
UPennONS
UPennONSUPennONS
UPennONS
 
Is 20TB really Big Data?
Is 20TB really Big Data?Is 20TB really Big Data?
Is 20TB really Big Data?
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Reaxys structure searching
Reaxys structure searchingReaxys structure searching
Reaxys structure searching
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
Websci17 final
Websci17 finalWebsci17 final
Websci17 final
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...
 
SWAT4LS 2014 SLIDE by Yamamoto
SWAT4LS 2014 SLIDE by YamamotoSWAT4LS 2014 SLIDE by Yamamoto
SWAT4LS 2014 SLIDE by Yamamoto
 

Andere mochten auch

Nuevos soportes 6c y d
Nuevos soportes 6c y dNuevos soportes 6c y d
Nuevos soportes 6c y dfpdcaro
 
A product-focused introduction to Machine Learning
A product-focused introduction to Machine LearningA product-focused introduction to Machine Learning
A product-focused introduction to Machine LearningSatpreet Singh
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer ExperienceYuan Wang
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldabaux singapore
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionIn a Rocket
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting PersonalKirsty Hulse
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome EconomyHelge Tennø
 

Andere mochten auch (8)

Nuevos soportes 6c y d
Nuevos soportes 6c y dNuevos soportes 6c y d
Nuevos soportes 6c y d
 
A product-focused introduction to Machine Learning
A product-focused introduction to Machine LearningA product-focused introduction to Machine Learning
A product-focused introduction to Machine Learning
 
Sifət (1) powerpoint
Sifət (1) powerpointSifət (1) powerpoint
Sifət (1) powerpoint
 
10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience10 Insightful Quotes On Designing A Better Customer Experience
10 Insightful Quotes On Designing A Better Customer Experience
 
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika AldabaLightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
Lightning Talk #9: How UX and Data Storytelling Can Shape Policy by Mika Aldaba
 
Learn BEM: CSS Naming Convention
Learn BEM: CSS Naming ConventionLearn BEM: CSS Naming Convention
Learn BEM: CSS Naming Convention
 
SEO: Getting Personal
SEO: Getting PersonalSEO: Getting Personal
SEO: Getting Personal
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Ähnlich wie Enhancing the RSC Archive with Data and Semantics

Digitally enabling the RSC archive
Digitally enabling the RSC archiveDigitally enabling the RSC archive
Digitally enabling the RSC archiveKen Karapetyan
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...Anubhav Jain
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningAnubhav Jain
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Anubhav Jain
 
Predicting Molecular Properties
Predicting Molecular PropertiesPredicting Molecular Properties
Predicting Molecular PropertiesYassin Youssfi
 
Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Anubhav Jain
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...Anubhav Jain
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applicationsaimsnist
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsAnubhav Jain
 
The eCrystals Federation
The eCrystals FederationThe eCrystals Federation
The eCrystals FederationManjulaPatel
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructureAnubhav Jain
 
Mining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsMining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsSean Ekins
 
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)Marcel Swart
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Rudy Potenzone
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Anubhav Jain
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Anubhav Jain
 

Ähnlich wie Enhancing the RSC Archive with Data and Semantics (20)

Digitally enabling the RSC archive
Digitally enabling the RSC archiveDigitally enabling the RSC archive
Digitally enabling the RSC archive
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
Predicting Molecular Properties
Predicting Molecular PropertiesPredicting Molecular Properties
Predicting Molecular Properties
 
Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...Discovering advanced materials for energy applications (with high-throughput ...
Discovering advanced materials for energy applications (with high-throughput ...
 
ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...ChemSpider reactions – delivering a free community resource of chemical synth...
ChemSpider reactions – delivering a free community resource of chemical synth...
 
OpenSciNY Open Notebook Science
OpenSciNY Open Notebook ScienceOpenSciNY Open Notebook Science
OpenSciNY Open Notebook Science
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
The eCrystals Federation
The eCrystals FederationThe eCrystals Federation
The eCrystals Federation
 
Materials Project computation and database infrastructure
Materials Project computation and database infrastructureMaterials Project computation and database infrastructure
Materials Project computation and database infrastructure
 
Mining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning modelsMining Big datasets to create and validate machine learning models
Mining Big datasets to create and validate machine learning models
 
Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...Serving the medicinal chemistry community with Royal Society of Chemistry che...
Serving the medicinal chemistry community with Royal Society of Chemistry che...
 
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
Presentation of ECOSTBio Action CM1305 at APC Keflavik (Iceland)
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 

Kürzlich hochgeladen

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Kürzlich hochgeladen (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Enhancing the RSC Archive with Data and Semantics

  • 1. Data Enhancing the RSC Archive Colin Batchelor, Ken Karapetyan, Alexey Pshenichov, Dave Sharpe, Jon Steele, Valery Tkachenko and Antony Williams ACS New Orleans April 2013
  • 2. Overview • The big picture • Where we’ve been • Statistics as well as semantics • New directions in experimental data • Where we’re going
  • 3. The big picture We have journal articles going back to 1841 and the aim is to extract: •Every small molecule we can (graphics and text) •Reactions •Spectra •Data in tables and classify every paper in a way that makes sense to the reader.
  • 4. Background • RSC Publishing moved to an all-XML workflow at the turn of the millennium. • We digitized the backfile (to 1841) in 2005. • We launched Project Prospect in 2007. • We acquired ChemSpider in 2009.
  • 5. RSC Advances New high-volume journal covering all of chemistry launched in 2011. Need a sensible way of navigating all this. http://www.rsc.org/advances http://www.rsc.org/RSCAdvancesSubjects
  • 6. Strategy • Use topic modelling: latent Dirichlet allocation (LDA) and Gibbs sampling to determine a set of “true” topics Thomas L. Griffiths and Mark Steyvers, “Finding scientific topics”, Proc. Natl. Acad. Sci. USA, 2004, 101, 5228–5235. • Publishing expertise gives us 12 broad subjects that will be intuitive to users • Merge first set to form second • Tweak
  • 7. Classify that classification Generated 128 topics based on 2009 and 2010’s articles (> 20000 papers). Generated Wordle images (www.wordle.net) of the topics for internal staff.
  • 8.
  • 9. Classify that classification: results 7 topics (75, 57, 65, 67, 82, 113, 123) were rejected for being nonsense. 1 topic (127) was rejected for being too general. 120 topics were classified under the 12 headings and given names. Examples…
  • 10. Examples 1: “kinetics” → Physical 2: “coordination complexes” → Inorganic 3: “general materials” → Materials 4: “misc. organic” → Organic 5: “bacteria” → Biological + Food and health 6: “theoretical” → Physical 7: “cells” → Bio 8: “water and solution chemistry” → Physical 9: “gels” → Materials 10: “inorganic material properties” → Physical + Inorganic + Materials 11: “general organic” → Organic 12: “coordination chemistry” → Inorganic 13: “photochemistry” → Inorganic + Materials + Energy
  • 11. “Very useful!” “Superb!” “… will make it easier for readers to identify papers which might be interesting to them.”
  • 12. What now? Shortly rolling out the subject classification to other general journals: •Chemical Communications •Chemical Science •Journal of Materials Chemistry A, B and C •New Journal of Chemistry
  • 13. Beyond Prospect: further steps in text-mining Migration to Oscar 4 https://bitbucket.org/wwmm/oscar4/wiki/Home Multiple name to structure engines OPSIN, ACD/Labs, Lexichem ACD/Labs Dictionary Better disambiguation Parallelization with Hadoop Structure validation and standardization (see later) Reaction extraction from text (see later)
  • 14. On an experimental run with names from Organic and Biomolecular Chemistry Is any structure returned at all by a given n2s engine? Lexichem = a (2798) ACD = b (3049) OPSIN = c (3309)
  • 15. Structure disagreements Out of 2588 names where at least one of the engines differed or didn’t return a result: A = ACD (1538 in total) B = Lexichem (1301 in total) C = OPSIN (2097 in total)
  • 16. Iterations With the Hadoop cluster, we can mine thousands of articles a night. We’re initially iterating over the material back to 2000, for which we have native XML. Then it’s a case of going back and testing out the OCRed material.
  • 17. http://cv.beta.rsc-us.org/ This is the beta site for •Extracting chemical structures from ChemDraw files •Most importantly: structure validation and standardization We will be using this for all of the extracted structures.
  • 18.
  • 19.
  • 20. Reaction extraction from text We have had some preliminary experience of this with Daniel Lowe (NextMove, formerly Cambridge)’s ChemicalTagger work. To go to ChemSpider Reactions: http://csr.dev.rsc-us.org/
  • 21. Experimental data We’ve already seen the possibilities for extracting data from organic experimental sections, but what about other sorts of data? Given chemical structures and extracted data we may be able to start building models and making them available.
  • 22. New directions in experimental data (1) We are working with William Brouwer (Penn State) to extract data from graphs. Obviously this is faute de mieux and we’d rather have the original data, but we’re giving a flavour of what might be possible.
  • 26. And now on ChemSpider…
  • 27.
  • 28. New directions in experimental data (2) Dye solar cell data is every bit as systematic as organic experimental sections.
  • 29. Human curation of results Previously: built into partly-manual annotation workflow. Currently: macro-scale, iterative. Coming: Challenger
  • 30. DERA • DERA will unveil from our archive – Chemicals – Reactions – Figures – Spectra/Analytical Data – Property Data – And yes….it will need curation and filtering!