Enhancing the RSC Archive with Data and Semantics

Data Enhancing the
RSC Archive
Colin Batchelor, Ken Karapetyan, Alexey
Pshenichov, Dave Sharpe, Jon Steele, Valery
Tkachenko and Antony Williams
ACS New Orleans April 2013

Overview
• The big picture
• Where we’ve been
• Statistics as well as semantics
• New directions in experimental data
• Where we’re going

The big picture
We have journal articles going back to 1841 and the
aim is to extract:
•Every small molecule we can (graphics and text)
•Reactions
•Spectra
•Data in tables
and classify every paper in a way that makes sense
to the reader.

Background
• RSC Publishing moved to an all-XML workflow
at the turn of the millennium.
• We digitized the backfile (to 1841) in 2005.
• We launched Project Prospect in 2007.
• We acquired ChemSpider in 2009.

RSC Advances

New high-volume journal covering all of chemistry
launched in 2011.

Need a sensible way of navigating all this.

http://www.rsc.org/advances
http://www.rsc.org/RSCAdvancesSubjects

Strategy

• Use topic modelling: latent Dirichlet allocation (LDA)
and Gibbs sampling to determine a set of “true” topics
Thomas L. Griffiths and Mark Steyvers, “Finding scientific topics”, Proc. Natl. Acad. Sci. USA, 2004, 101, 5228–5235.

• Publishing expertise gives us 12 broad subjects that
will be intuitive to users
• Merge first set to form second
• Tweak

Classify that classification
Generated 128 topics based on 2009 and 2010’s
articles (> 20000 papers).

Generated Wordle images (www.wordle.net) of
the topics for internal staff.

Classify that classification: results
7 topics (75, 57, 65, 67, 82, 113, 123) were
rejected for being nonsense.
1 topic (127) was rejected for being too general.
120 topics were classified under the 12 headings
and given names.

Examples…

Examples
1: “kinetics” → Physical
2: “coordination complexes” → Inorganic
3: “general materials” → Materials
4: “misc. organic” → Organic
5: “bacteria” → Biological + Food and health
6: “theoretical” → Physical
7: “cells” → Bio
8: “water and solution chemistry” → Physical
9: “gels” → Materials
10: “inorganic material properties” → Physical + Inorganic + Materials
11: “general organic” → Organic
12: “coordination chemistry” → Inorganic
13: “photochemistry” → Inorganic + Materials + Energy

“Very useful!”
“Superb!”
“… will make it
easier for
readers to
identify papers
which might be
interesting to
them.”

What now?
Shortly rolling out the subject classification to
other general journals:
•Chemical Communications
•Chemical Science
•Journal of Materials Chemistry A, B and C
•New Journal of Chemistry

Beyond Prospect: further steps in
text-mining
Migration to Oscar 4
https://bitbucket.org/wwmm/oscar4/wiki/Home
Multiple name to structure engines
OPSIN, ACD/Labs, Lexichem
ACD/Labs Dictionary
Better disambiguation
Parallelization with Hadoop
Structure validation and standardization (see later)
Reaction extraction from text (see later)

On an experimental
run with names from
Organic and
Biomolecular Chemistry

Is any structure
returned at all by a
given n2s engine?

Lexichem = a (2798)
ACD = b (3049)
OPSIN = c (3309)

Structure
disagreements

Out of 2588 names
where at least one of
the engines differed
or didn’t return a
result:

A = ACD
(1538 in total)
B = Lexichem
(1301 in total)
C = OPSIN
(2097 in total)

Iterations
With the Hadoop cluster, we can mine
thousands of articles a night.

We’re initially iterating over the material back to
2000, for which we have native XML. Then it’s a
case of going back and testing out the OCRed
material.

http://cv.beta.rsc-us.org/
This is the beta site for
•Extracting chemical structures from ChemDraw
files
•Most importantly: structure validation and
standardization

We will be using this for all of the extracted
structures.

Reaction extraction from text

We have had some preliminary experience of this with Daniel
Lowe (NextMove, formerly Cambridge)’s ChemicalTagger
work.

To go to ChemSpider Reactions:
http://csr.dev.rsc-us.org/

Experimental data
We’ve already seen the possibilities for
extracting data from organic experimental
sections, but what about other sorts of data?

Given chemical structures and extracted data
we may be able to start building models and
making them available.

New directions in experimental
data (1)
We are working with William Brouwer (Penn
State) to extract data from graphs.

Obviously this is faute de mieux and we’d rather
have the original data, but we’re giving a flavour
of what might be possible.

New directions in experimental
data (2)
Dye solar cell data is every bit as systematic as
organic experimental sections.

Human curation of results
Previously: built into partly-manual annotation
workflow.

Currently: macro-scale, iterative.

Coming: Challenger

DERA
• DERA will unveil from our archive
– Chemicals
– Reactions
– Figures
– Spectra/Analytical Data
– Property Data

– And yes….it will need curation and filtering!

Enhancing the RSC Archive with Data and Semantics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie Enhancing the RSC Archive with Data and Semantics

Ähnlich wie Enhancing the RSC Archive with Data and Semantics (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Enhancing the RSC Archive with Data and Semantics