OPSIN (Open Parser for Systematic IUPAC nomenclature) has developed into a mature solution for chemical name to structure conversion. Together with other Open Source utilities such as OSCAR4, ChemSpot, and ChemicalTagger, we now have the tools to address many of the problems in chemical text mining. This ecosystem of tools has facilitated the extraction of over a million reactions, from the US patent literature, which are now available freely to all under CC-Zero. I will describe advances in OPSIN, how reactions can be extracted from text, and present some interesting analyses that are made possible by the public availability of this dataset.
Breaking the Code : A Guide to WhatsApp Business API.pdf
From Open text mining solutions to Open Data resources
1. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
From Open text mining solutions to
Open Data resources
Daniel Lowe
NextMove Software
Cambridge, UK
2. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
The idea
Accessible text
e.g. US patents
Open Reaction Data
resource
Reaction
Extraction
System
4. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
ol
What is chemical name to structure?
(2S)- but2-Amino 1--
Stereochemistry locant substituent locant alk unsaturation suffix
an
NH2•
5. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Supported chain nomenclature
Alkanes Heteroatom hydrides Heterogeneous heteroatom hydrides
dodectetractkiliane pentaphosphane disilazane
Trivial acids
butyric acid
6. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Supported ring nomenclature
Monocyclic spiro
dispiro[4.2.4.2]tetradecane
Hantzsch-Widman
1,3,5-triazine
furo[3,2-b]thieno[2,3-e]pyridine 2,2':6',2''-terpyridyl
Fused ring Ring assembly
Von Baeyer
tricyclo[2.2.1.12,5]octane
Polycyclic spiro
spiro[piperidine-4,9'-xanthene]
7. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Structural assembly
nomenclature
Conjunctive nomenclature
benzeneethanol
Substitutive nomenclature
2,4,6-trinitrotoluene
Additive nomenclature
methylsulfonyl
Multiplicative nomenclature
4,4'-methylenedioxydibenzoic acid
Functional class
nomenclature
ethyl alcohol
8. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Structural modifications
Heteroatom replacement
1-thia-4-aza-2,6-disilacyclohexane
Unsaturation
hexa-1,3-dien-5-yne
Hydro, dehydro, indicated
hydrogen and added hydrogen
2,7-dihydro-1H-azepine
Functional replacement
Suffixes including
infixed suffixes
methanedithioic acid
1-chloro-2,4-
diimidotricarbonic acid
Lambda convention
2λ6-trisulfane
9. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Bridges and stereochemistry
Bridges
4a,8a-propanoquinoline
E/Z stereochemistry
(Z)-2-chloro-but-2-ene
Relative cis/trans stereochemistry
trans-2,6-dimethyl-2,6-dihydronaphthalene
R/S stereochemistry
(1R,3S)-3-amino-3-methylcyclohexanol
10. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Miscellaneous nomenclature
1,3-xylene
Groups with indeterminately
positioned structural features
Charge and oxidation
numbers
methylmercury(1+) or
methylmercury(II)
“per-nomenclature”
2-deoxy-ᴅ-ribose
Subtractive nomenclature
perhydroanthracene
perchlorobenzene
11. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Polymer nomenclature
poly[(benzo[1,2-d:4,5-d']bis[1,3]thiazole-2,6-diyl)-1,4-phenyleneoxy-1,3-phenylene(1,3,5,7-tetraoxo-
1,2,3,5,6,7-hexahydrobenzo[1,2-c:4,5-c']dipyrrole-2,6-diyl)-1,3-phenyleneoxy-1,4-phenylene]
Structure-based polymer nomenclature
12. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Domain specific nomenclature
Steroid nomenclature
17β-Hydroxy-8α,9β,10α-androst-4-en-3-one
ʟ-leucinamide
Amino acid
cyclo(ᴅ-alanyl-ʟ-phenylalanyl)ʟ-arginyl-O-phosphono-ʟ-seryl-ʟ-alanyl-ʟ-proline
Oligopeptide Cyclic peptide
guanylyl(3'-5')uridine 3'-monophosphate
Nucleotide nomenclature
13. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Carbohydrates
ʟ-ribo-ᴅ-manno-nonose
2,7-anhydro-D-glycero-β-D-galacto-oct-2-ulopyranosonic acid
β-ᴅ-Fructofuranosyl α-ᴅ-glucopyranoside
β-ᴅ-glucopyranosyl-(1→3)-β-ᴅ-glucopyranosyl-
(1→3)-ᴅ-glucopyranose
14. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Usage
Batch conversion on the
command line
RESTful web service
(opsin.ch.cam.ac.uk)
NameToStructure nts = NameToStructure.getInstance();
String chemicalName = "acetonitrile";
String smiles = nts.parseToSmiles(chemicalName);
Java API
java -jar opsin-1.6.0-jar-with-dependencies.jar -osmi input.txt output.smi
15. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Who is using OPSIN?
Commercial software
Cinfony
(interface to
Python)
Many text mining efforts
Workflows Web services
16. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Steps involved
• Identifying experimental sections
• Identifying chemical entities
• Chemical name to structure conversion
(including anaphora resolution)
• Associating chemical entities with quantities
• Assigning chemical roles
• Atom-atom mapping
17. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Example
Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate
To a solution of methyl 4-(chlorosulfonyl)benzoate (606 mg,
2.1 mmol, 1 eq) in DCM (35 ml) was added
pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N
(540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred
at room temperature until all of the starting material was
consumed. The solvent was evaporated in vacuo and the
residue redissolved in ethyl acetate (10 ml), washed with
water (10 ml), saturated sodium hydrogen carbonate (10
ml), dried over sodium sulphate, filtered and evaporated to
yield the title compound as a white solid (690 mg, 1.8
mmol, 85%).
19. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
CML output
<reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-..
<dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([..
<productList>
<product role="product">
<molecule id="m0">
<name dictRef="nameDict:unknown">title compound</name>
</molecule>
<amount units="unit:mmol">1.8</amount>
<amount units="unit:mg">690</amount>
<amount units="unit:percentYield">85.0</amount>
<identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/>
<identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H..
<dl:entityType>definiteReference</dl:entityType>
<dl:state>solid</dl:state>
</product>
</productList>
<reactantList>
<reactant role="reactant" count="1">
<molecule id="m1">
<name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name>
</molecule>
<amount units="unit:mmol">2.1</amount>
<amount units="unit:mg">606</amount>
<amount units="unit:eq">1.0</amount>
<identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/>
Quantities including yield are extracted
Entity is classified as an exact compound,
definite reference, chemical class or fragment
Reaction SMILES
SMILES and InChIs for every structure
resolvable reagent/product
20. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Current status
• ~1 million reactions from US patent
applications (2001-2013)
• ~1 million reactions from US patent grants
(1976-2013)
• At minimum over a million constitutionally
distinct reactions
21. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
https://bitbucket.org/dan2097/patent-reaction-extraction/downloads
Current status
25. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Are solvents getting greener?
1976 2013
Water (21%) Tetrahydrofuran (15%)
Ethanol (11%) Dichloromethane (14%)
Benzene (8%) Water (13%)
Methanol (7%) Dimethylformamide (10%)
Tetrahydrofuran (5%) Methanol (8%)
Dichloromethane (4%) Ethyl acetate (7%)
Dimethylformamide (4%) Ethanol (5%)
Acetic acid (4%) 1,4-Dioxane (4%)
Chloroform (3%) Toluene (3%)
Acetone (3%) Acetonitrile (3%)
Total for top 10: 71% 82%
26. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Conclusions
Open Source tools facilitate reuse and remixing of
code
Open Data allows reuse in an infinite number of
potential applications and analyses
27. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Acknowledgements
• Albina Asadulina
• Peter Corbett
• Robert Glen
• David Jessop
• Lezan Hawizy
• Peter Murray-Rust
• Roger Sayle
28. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com