The Royal Society of Chemistry has an archive of published journals and books stretching back to 1841. In the past decade we have digitized this archive and semantically enriched our frontfile data with chemical structures linked to our free online chemical compound database, ChemSpider. In this talk we will survey our recent efforts to extract all kinds of data – chemical structures, experimental and bibliographic data – from both our backfile and frontfile. We will also discuss our future work to extract chemical reactions to host in our ChemSpider Reactions database and will discuss the potential applications of optical structure recognition technologies for converting structure images to structures as well as using similar techniques to convert experimental spectral data into interactive data formats. A key aspect of this project is the delivery of a crowdsourcing platform for the interactive annotation and validation of the extracted data.
1. Data Enhancing the
RSC Archive
Colin Batchelor, Ken Karapetyan, Alexey
Pshenichov, Dave Sharpe, Jon Steele, Valery
Tkachenko and Antony Williams
ACS New Orleans April 2013
2. Overview
• The big picture
• Where we’ve been
• Statistics as well as semantics
• New directions in experimental data
• Where we’re going
3. The big picture
We have journal articles going back to 1841 and the
aim is to extract:
•Every small molecule we can (graphics and text)
•Reactions
•Spectra
•Data in tables
and classify every paper in a way that makes sense
to the reader.
4. Background
• RSC Publishing moved to an all-XML workflow
at the turn of the millennium.
• We digitized the backfile (to 1841) in 2005.
• We launched Project Prospect in 2007.
• We acquired ChemSpider in 2009.
5. RSC Advances
New high-volume journal covering all of chemistry
launched in 2011.
Need a sensible way of navigating all this.
http://www.rsc.org/advances
http://www.rsc.org/RSCAdvancesSubjects
6. Strategy
• Use topic modelling: latent Dirichlet allocation (LDA)
and Gibbs sampling to determine a set of “true” topics
Thomas L. Griffiths and Mark Steyvers, “Finding scientific topics”, Proc. Natl. Acad. Sci. USA, 2004, 101, 5228–5235.
• Publishing expertise gives us 12 broad subjects that
will be intuitive to users
• Merge first set to form second
• Tweak
7. Classify that classification
Generated 128 topics based on 2009 and 2010’s
articles (> 20000 papers).
Generated Wordle images (www.wordle.net) of
the topics for internal staff.
8.
9. Classify that classification: results
7 topics (75, 57, 65, 67, 82, 113, 123) were
rejected for being nonsense.
1 topic (127) was rejected for being too general.
120 topics were classified under the 12 headings
and given names.
Examples…
11. “Very useful!”
“Superb!”
“… will make it
easier for
readers to
identify papers
which might be
interesting to
them.”
12. What now?
Shortly rolling out the subject classification to
other general journals:
•Chemical Communications
•Chemical Science
•Journal of Materials Chemistry A, B and C
•New Journal of Chemistry
13. Beyond Prospect: further steps in
text-mining
Migration to Oscar 4
https://bitbucket.org/wwmm/oscar4/wiki/Home
Multiple name to structure engines
OPSIN, ACD/Labs, Lexichem
ACD/Labs Dictionary
Better disambiguation
Parallelization with Hadoop
Structure validation and standardization (see later)
Reaction extraction from text (see later)
14. On an experimental
run with names from
Organic and
Biomolecular Chemistry
Is any structure
returned at all by a
given n2s engine?
Lexichem = a (2798)
ACD = b (3049)
OPSIN = c (3309)
15. Structure
disagreements
Out of 2588 names
where at least one of
the engines differed
or didn’t return a
result:
A = ACD
(1538 in total)
B = Lexichem
(1301 in total)
C = OPSIN
(2097 in total)
16. Iterations
With the Hadoop cluster, we can mine
thousands of articles a night.
We’re initially iterating over the material back to
2000, for which we have native XML. Then it’s a
case of going back and testing out the OCRed
material.
17. http://cv.beta.rsc-us.org/
This is the beta site for
•Extracting chemical structures from ChemDraw
files
•Most importantly: structure validation and
standardization
We will be using this for all of the extracted
structures.
18.
19.
20. Reaction extraction from text
We have had some preliminary experience of this with Daniel
Lowe (NextMove, formerly Cambridge)’s ChemicalTagger
work.
To go to ChemSpider Reactions:
http://csr.dev.rsc-us.org/
21. Experimental data
We’ve already seen the possibilities for
extracting data from organic experimental
sections, but what about other sorts of data?
Given chemical structures and extracted data
we may be able to start building models and
making them available.
22. New directions in experimental
data (1)
We are working with William Brouwer (Penn
State) to extract data from graphs.
Obviously this is faute de mieux and we’d rather
have the original data, but we’re giving a flavour
of what might be possible.
28. New directions in experimental
data (2)
Dye solar cell data is every bit as systematic as
organic experimental sections.
29. Human curation of results
Previously: built into partly-manual annotation
workflow.
Currently: macro-scale, iterative.
Coming: Challenger
30. DERA
• DERA will unveil from our archive
– Chemicals
– Reactions
– Figures
– Spectra/Analytical Data
– Property Data
– And yes….it will need curation and filtering!