SlideShare ist ein Scribd-Unternehmen logo
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
From Open text mining solutions to
Open Data resources
Daniel Lowe
NextMove Software
Cambridge, UK
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
The idea
Accessible text
e.g. US patents
Open Reaction Data
resource
Reaction
Extraction
System
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Building on existing projects
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
ol
What is chemical name to structure?
(2S)- but2-Amino 1--
Stereochemistry locant substituent locant alk unsaturation suffix
an
NH2•
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Supported chain nomenclature
Alkanes Heteroatom hydrides Heterogeneous heteroatom hydrides
dodectetractkiliane pentaphosphane disilazane
Trivial acids
butyric acid
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Supported ring nomenclature
Monocyclic spiro
dispiro[4.2.4.2]tetradecane
Hantzsch-Widman
1,3,5-triazine
furo[3,2-b]thieno[2,3-e]pyridine 2,2':6',2''-terpyridyl
Fused ring Ring assembly
Von Baeyer
tricyclo[2.2.1.12,5]octane
Polycyclic spiro
spiro[piperidine-4,9'-xanthene]
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Structural assembly
nomenclature
Conjunctive nomenclature
benzeneethanol
Substitutive nomenclature
2,4,6-trinitrotoluene
Additive nomenclature
methylsulfonyl
Multiplicative nomenclature
4,4'-methylenedioxydibenzoic acid
Functional class
nomenclature
ethyl alcohol
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Structural modifications
Heteroatom replacement
1-thia-4-aza-2,6-disilacyclohexane
Unsaturation
hexa-1,3-dien-5-yne
Hydro, dehydro, indicated
hydrogen and added hydrogen
2,7-dihydro-1H-azepine
Functional replacement
Suffixes including
infixed suffixes
methanedithioic acid
1-chloro-2,4-
diimidotricarbonic acid
Lambda convention
2λ6-trisulfane
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Bridges and stereochemistry
Bridges
4a,8a-propanoquinoline
E/Z stereochemistry
(Z)-2-chloro-but-2-ene
Relative cis/trans stereochemistry
trans-2,6-dimethyl-2,6-dihydronaphthalene
R/S stereochemistry
(1R,3S)-3-amino-3-methylcyclohexanol
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Miscellaneous nomenclature
1,3-xylene
Groups with indeterminately
positioned structural features
Charge and oxidation
numbers
methylmercury(1+) or
methylmercury(II)
“per-nomenclature”
2-deoxy-ᴅ-ribose
Subtractive nomenclature
perhydroanthracene
perchlorobenzene
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Polymer nomenclature
poly[(benzo[1,2-d:4,5-d']bis[1,3]thiazole-2,6-diyl)-1,4-phenyleneoxy-1,3-phenylene(1,3,5,7-tetraoxo-
1,2,3,5,6,7-hexahydrobenzo[1,2-c:4,5-c']dipyrrole-2,6-diyl)-1,3-phenyleneoxy-1,4-phenylene]
Structure-based polymer nomenclature
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Domain specific nomenclature
Steroid nomenclature
17β-Hydroxy-8α,9β,10α-androst-4-en-3-one
ʟ-leucinamide
Amino acid
cyclo(ᴅ-alanyl-ʟ-phenylalanyl)ʟ-arginyl-O-phosphono-ʟ-seryl-ʟ-alanyl-ʟ-proline
Oligopeptide Cyclic peptide
guanylyl(3'-5')uridine 3'-monophosphate
Nucleotide nomenclature
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Carbohydrates
ʟ-ribo-ᴅ-manno-nonose
2,7-anhydro-D-glycero-β-D-galacto-oct-2-ulopyranosonic acid
β-ᴅ-Fructofuranosyl α-ᴅ-glucopyranoside
β-ᴅ-glucopyranosyl-(1→3)-β-ᴅ-glucopyranosyl-
(1→3)-ᴅ-glucopyranose
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Usage
Batch conversion on the
command line
RESTful web service
(opsin.ch.cam.ac.uk)
NameToStructure nts = NameToStructure.getInstance();
String chemicalName = "acetonitrile";
String smiles = nts.parseToSmiles(chemicalName);
Java API
java -jar opsin-1.6.0-jar-with-dependencies.jar -osmi input.txt output.smi
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Who is using OPSIN?
Commercial software
Cinfony
(interface to
Python)
Many text mining efforts
Workflows Web services
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Steps involved
• Identifying experimental sections
• Identifying chemical entities
• Chemical name to structure conversion
(including anaphora resolution)
• Associating chemical entities with quantities
• Assigning chemical roles
• Atom-atom mapping
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Example
Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate
To a solution of methyl 4-(chlorosulfonyl)benzoate (606 mg,
2.1 mmol, 1 eq) in DCM (35 ml) was added
pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N
(540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred
at room temperature until all of the starting material was
consumed. The solvent was evaporated in vacuo and the
residue redissolved in ethyl acetate (10 ml), washed with
water (10 ml), saturated sodium hydrogen carbonate (10
ml), dried over sodium sulphate, filtered and evaporated to
yield the title compound as a white solid (690 mg, 1.8
mmol, 85%).
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Graphical Output
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
CML output
<reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-..
<dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([..
<productList>
<product role="product">
<molecule id="m0">
<name dictRef="nameDict:unknown">title compound</name>
</molecule>
<amount units="unit:mmol">1.8</amount>
<amount units="unit:mg">690</amount>
<amount units="unit:percentYield">85.0</amount>
<identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/>
<identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H..
<dl:entityType>definiteReference</dl:entityType>
<dl:state>solid</dl:state>
</product>
</productList>
<reactantList>
<reactant role="reactant" count="1">
<molecule id="m1">
<name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name>
</molecule>
<amount units="unit:mmol">2.1</amount>
<amount units="unit:mg">606</amount>
<amount units="unit:eq">1.0</amount>
<identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/>
Quantities including yield are extracted
Entity is classified as an exact compound,
definite reference, chemical class or fragment
Reaction SMILES
SMILES and InChIs for every structure
resolvable reagent/product
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Current status
• ~1 million reactions from US patent
applications (2001-2013)
• ~1 million reactions from US patent grants
(1976-2013)
• At minimum over a million constitutionally
distinct reactions
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
https://bitbucket.org/dan2097/patent-reaction-extraction/downloads
Current status
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Identify Synthetic Routes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Intermediates 197702103114 56611 31403 17268 9230 5057 2701 1256 639 301 136 58 15 5 2
Terminal Products 385149149445 81837 47579 27670 16619 9320 5263 2511 1330 678 373 111 63 8 6 5
0
100000
200000
300000
400000
500000
600000
700000
Occurrences
Number of steps
Intermediates
Terminal Products
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Trends in Reaction Types
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Suzukicouplingsasapercentageofreactionsinayear
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Trends In Solvent Use
0.0%
5.0%
10.0%
15.0%
20.0%
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Percentageofreactionsinthatyear
Tetrahydrofuran
Dichloromethane
Water
Dimethylformamide
Methanol
Ethyl acetate
Ethanol
1,4-Dioxane
Toluene
Acetonitrile
Acetic acid
Chloroform
Acetone
Benzene
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Are solvents getting greener?
1976 2013
Water (21%) Tetrahydrofuran (15%)
Ethanol (11%) Dichloromethane (14%)
Benzene (8%) Water (13%)
Methanol (7%) Dimethylformamide (10%)
Tetrahydrofuran (5%) Methanol (8%)
Dichloromethane (4%) Ethyl acetate (7%)
Dimethylformamide (4%) Ethanol (5%)
Acetic acid (4%) 1,4-Dioxane (4%)
Chloroform (3%) Toluene (3%)
Acetone (3%) Acetonitrile (3%)
Total for top 10: 71% 82%
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Conclusions
Open Source tools facilitate reuse and remixing of
code
Open Data allows reuse in an infinite number of
potential applications and analyses
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Acknowledgements
• Albina Asadulina
• Peter Corbett
• Robert Glen
• David Jessop
• Lezan Hawizy
• Peter Murray-Rust
• Roger Sayle
Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014
Thank you for your time!
http://nextmovesoftware.com
http://nextmovesoftware.com/blog
daniel@nextmovesoftware.com

Weitere ähnliche Inhalte

Ähnlich wie From Open text mining solutions to Open Data resources

FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
Carole Goble
 

Ähnlich wie From Open text mining solutions to Open Data resources (20)

Keynote ICSB 2014
Keynote ICSB 2014Keynote ICSB 2014
Keynote ICSB 2014
 
The eCrystals Federation
The eCrystals FederationThe eCrystals Federation
The eCrystals Federation
 
What is DataCite-screenshots
What is DataCite-screenshotsWhat is DataCite-screenshots
What is DataCite-screenshots
 
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
 
Approach and outcome of the Biodiversity Virtual e-Laboratory (BioVeL) project
Approach and outcome of the Biodiversity Virtual e-Laboratory (BioVeL) projectApproach and outcome of the Biodiversity Virtual e-Laboratory (BioVeL) project
Approach and outcome of the Biodiversity Virtual e-Laboratory (BioVeL) project
 
Open Notebook Science and the Future of Libraries
Open Notebook Science and the Future of LibrariesOpen Notebook Science and the Future of Libraries
Open Notebook Science and the Future of Libraries
 
Scholar2Scholar presentation
Scholar2Scholar presentationScholar2Scholar presentation
Scholar2Scholar presentation
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
 
In grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solutionIn grammars we trust: LeadMine, a knowledge driven solution
In grammars we trust: LeadMine, a knowledge driven solution
 
Using Neo4j for exploring the research graph connections made by RD-Switchboard
Using Neo4j for exploring the research graph connections made by RD-SwitchboardUsing Neo4j for exploring the research graph connections made by RD-Switchboard
Using Neo4j for exploring the research graph connections made by RD-Switchboard
 
Royal Society of Chemistry open source cheminformatics platforms and libraries
Royal Society of Chemistry open source cheminformatics platforms and librariesRoyal Society of Chemistry open source cheminformatics platforms and libraries
Royal Society of Chemistry open source cheminformatics platforms and libraries
 
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data WarehouseMaking Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
 
(Green chemistry and sustainable technology) an hui lu, sheng dai (eds.)-poro...
(Green chemistry and sustainable technology) an hui lu, sheng dai (eds.)-poro...(Green chemistry and sustainable technology) an hui lu, sheng dai (eds.)-poro...
(Green chemistry and sustainable technology) an hui lu, sheng dai (eds.)-poro...
 
JCBMemorial9-8-14-Owens
JCBMemorial9-8-14-OwensJCBMemorial9-8-14-Owens
JCBMemorial9-8-14-Owens
 
Open Notebook Science and Preliminary Falcipain-2 Results
Open Notebook Science and Preliminary Falcipain-2 ResultsOpen Notebook Science and Preliminary Falcipain-2 Results
Open Notebook Science and Preliminary Falcipain-2 Results
 
Active research management and sharing
Active research management and sharingActive research management and sharing
Active research management and sharing
 
Accessing 3D Printable Structures Online
Accessing 3D Printable Structures OnlineAccessing 3D Printable Structures Online
Accessing 3D Printable Structures Online
 
OpenSciNY Open Notebook Science
OpenSciNY Open Notebook ScienceOpenSciNY Open Notebook Science
OpenSciNY Open Notebook Science
 
The Research Object Initiative: Frameworks and Use Cases
The Research Object Initiative:Frameworks and Use CasesThe Research Object Initiative:Frameworks and Use Cases
The Research Object Initiative: Frameworks and Use Cases
 
Manchester Open Notebook Science Talk
Manchester Open Notebook Science TalkManchester Open Notebook Science Talk
Manchester Open Notebook Science Talk
 

Mehr von dan2097

Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
dan2097
 

Mehr von dan2097 (6)

Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
 
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclatureOPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
 
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical NomenclatureOPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
 
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
 
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChIInChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
 

Kürzlich hochgeladen

Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
mbmh111980
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
Max Lee
 

Kürzlich hochgeladen (20)

SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
Tree in the Forest - Managing Details in BDD Scenarios (live2test 2024)
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdfMastering Windows 7 A Comprehensive Guide for Power Users .pdf
Mastering Windows 7 A Comprehensive Guide for Power Users .pdf
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)JustNaik Solution Deck (stage bus sector)
JustNaik Solution Deck (stage bus sector)
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 

From Open text mining solutions to Open Data resources

  • 1. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 From Open text mining solutions to Open Data resources Daniel Lowe NextMove Software Cambridge, UK
  • 2. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 The idea Accessible text e.g. US patents Open Reaction Data resource Reaction Extraction System
  • 3. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Building on existing projects
  • 4. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 ol What is chemical name to structure? (2S)- but2-Amino 1-- Stereochemistry locant substituent locant alk unsaturation suffix an NH2•
  • 5. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Supported chain nomenclature Alkanes Heteroatom hydrides Heterogeneous heteroatom hydrides dodectetractkiliane pentaphosphane disilazane Trivial acids butyric acid
  • 6. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Supported ring nomenclature Monocyclic spiro dispiro[4.2.4.2]tetradecane Hantzsch-Widman 1,3,5-triazine furo[3,2-b]thieno[2,3-e]pyridine 2,2':6',2''-terpyridyl Fused ring Ring assembly Von Baeyer tricyclo[2.2.1.12,5]octane Polycyclic spiro spiro[piperidine-4,9'-xanthene]
  • 7. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Structural assembly nomenclature Conjunctive nomenclature benzeneethanol Substitutive nomenclature 2,4,6-trinitrotoluene Additive nomenclature methylsulfonyl Multiplicative nomenclature 4,4'-methylenedioxydibenzoic acid Functional class nomenclature ethyl alcohol
  • 8. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Structural modifications Heteroatom replacement 1-thia-4-aza-2,6-disilacyclohexane Unsaturation hexa-1,3-dien-5-yne Hydro, dehydro, indicated hydrogen and added hydrogen 2,7-dihydro-1H-azepine Functional replacement Suffixes including infixed suffixes methanedithioic acid 1-chloro-2,4- diimidotricarbonic acid Lambda convention 2λ6-trisulfane
  • 9. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Bridges and stereochemistry Bridges 4a,8a-propanoquinoline E/Z stereochemistry (Z)-2-chloro-but-2-ene Relative cis/trans stereochemistry trans-2,6-dimethyl-2,6-dihydronaphthalene R/S stereochemistry (1R,3S)-3-amino-3-methylcyclohexanol
  • 10. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Miscellaneous nomenclature 1,3-xylene Groups with indeterminately positioned structural features Charge and oxidation numbers methylmercury(1+) or methylmercury(II) “per-nomenclature” 2-deoxy-ᴅ-ribose Subtractive nomenclature perhydroanthracene perchlorobenzene
  • 11. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Polymer nomenclature poly[(benzo[1,2-d:4,5-d']bis[1,3]thiazole-2,6-diyl)-1,4-phenyleneoxy-1,3-phenylene(1,3,5,7-tetraoxo- 1,2,3,5,6,7-hexahydrobenzo[1,2-c:4,5-c']dipyrrole-2,6-diyl)-1,3-phenyleneoxy-1,4-phenylene] Structure-based polymer nomenclature
  • 12. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Domain specific nomenclature Steroid nomenclature 17β-Hydroxy-8α,9β,10α-androst-4-en-3-one ʟ-leucinamide Amino acid cyclo(ᴅ-alanyl-ʟ-phenylalanyl)ʟ-arginyl-O-phosphono-ʟ-seryl-ʟ-alanyl-ʟ-proline Oligopeptide Cyclic peptide guanylyl(3'-5')uridine 3'-monophosphate Nucleotide nomenclature
  • 13. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Carbohydrates ʟ-ribo-ᴅ-manno-nonose 2,7-anhydro-D-glycero-β-D-galacto-oct-2-ulopyranosonic acid β-ᴅ-Fructofuranosyl α-ᴅ-glucopyranoside β-ᴅ-glucopyranosyl-(1→3)-β-ᴅ-glucopyranosyl- (1→3)-ᴅ-glucopyranose
  • 14. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Usage Batch conversion on the command line RESTful web service (opsin.ch.cam.ac.uk) NameToStructure nts = NameToStructure.getInstance(); String chemicalName = "acetonitrile"; String smiles = nts.parseToSmiles(chemicalName); Java API java -jar opsin-1.6.0-jar-with-dependencies.jar -osmi input.txt output.smi
  • 15. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Who is using OPSIN? Commercial software Cinfony (interface to Python) Many text mining efforts Workflows Web services
  • 16. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Steps involved • Identifying experimental sections • Identifying chemical entities • Chemical name to structure conversion (including anaphora resolution) • Associating chemical entities with quantities • Assigning chemical roles • Atom-atom mapping
  • 17. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Example Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate To a solution of methyl 4-(chlorosulfonyl)benzoate (606 mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N (540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred at room temperature until all of the starting material was consumed. The solvent was evaporated in vacuo and the residue redissolved in ethyl acetate (10 ml), washed with water (10 ml), saturated sodium hydrogen carbonate (10 ml), dried over sodium sulphate, filtered and evaporated to yield the title compound as a white solid (690 mg, 1.8 mmol, 85%).
  • 18. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Graphical Output
  • 19. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 CML output <reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-.. <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([.. <productList> <product role="product"> <molecule id="m0"> <name dictRef="nameDict:unknown">title compound</name> </molecule> <amount units="unit:mmol">1.8</amount> <amount units="unit:mg">690</amount> <amount units="unit:percentYield">85.0</amount> <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/> <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H.. <dl:entityType>definiteReference</dl:entityType> <dl:state>solid</dl:state> </product> </productList> <reactantList> <reactant role="reactant" count="1"> <molecule id="m1"> <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name> </molecule> <amount units="unit:mmol">2.1</amount> <amount units="unit:mg">606</amount> <amount units="unit:eq">1.0</amount> <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/> Quantities including yield are extracted Entity is classified as an exact compound, definite reference, chemical class or fragment Reaction SMILES SMILES and InChIs for every structure resolvable reagent/product
  • 20. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Current status • ~1 million reactions from US patent applications (2001-2013) • ~1 million reactions from US patent grants (1976-2013) • At minimum over a million constitutionally distinct reactions
  • 21. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 https://bitbucket.org/dan2097/patent-reaction-extraction/downloads Current status
  • 22. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Identify Synthetic Routes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Intermediates 197702103114 56611 31403 17268 9230 5057 2701 1256 639 301 136 58 15 5 2 Terminal Products 385149149445 81837 47579 27670 16619 9320 5263 2511 1330 678 373 111 63 8 6 5 0 100000 200000 300000 400000 500000 600000 700000 Occurrences Number of steps Intermediates Terminal Products
  • 23. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Trends in Reaction Types 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Suzukicouplingsasapercentageofreactionsinayear
  • 24. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Trends In Solvent Use 0.0% 5.0% 10.0% 15.0% 20.0% 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Percentageofreactionsinthatyear Tetrahydrofuran Dichloromethane Water Dimethylformamide Methanol Ethyl acetate Ethanol 1,4-Dioxane Toluene Acetonitrile Acetic acid Chloroform Acetone Benzene
  • 25. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Are solvents getting greener? 1976 2013 Water (21%) Tetrahydrofuran (15%) Ethanol (11%) Dichloromethane (14%) Benzene (8%) Water (13%) Methanol (7%) Dimethylformamide (10%) Tetrahydrofuran (5%) Methanol (8%) Dichloromethane (4%) Ethyl acetate (7%) Dimethylformamide (4%) Ethanol (5%) Acetic acid (4%) 1,4-Dioxane (4%) Chloroform (3%) Toluene (3%) Acetone (3%) Acetonitrile (3%) Total for top 10: 71% 82%
  • 26. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Conclusions Open Source tools facilitate reuse and remixing of code Open Data allows reuse in an infinite number of potential applications and analyses
  • 27. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Acknowledgements • Albina Asadulina • Peter Corbett • Robert Glen • David Jessop • Lezan Hawizy • Peter Murray-Rust • Roger Sayle
  • 28. Jean-Claude Bradley Memorial Symposium, University of Cambridge, 14th July 2014 Thank you for your time! http://nextmovesoftware.com http://nextmovesoftware.com/blog daniel@nextmovesoftware.com