SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Automated Extraction of Reactions from the
            Patent Literature




                        Daniel Lowe
     Unilever Centre for Molecular Science Informatics
                 University of Cambridge




                                                         1
Chemistry patent applications
• 100,000s applications each year
                                               400000


                                               350000
      Chemistry patent applications per year




                                               300000


                                               250000


                                               200000


                                               150000


                                               100000


                                                50000


                                                    0
                                                        2000   2001   2002   2003   2004   2005     2006    2007     2008    2009

                                                                                                  World Intellectual Property Indicators, 2011 edition

                                                                                                                                               2
3
The idea
   XML patents




     Reaction
    Extraction
      System




Extracted Reactions

                      4
Steps involved
•   Identifying experimental sections
•   Identifying chemical entities
•   Chemical name to structure conversion
•   Associating chemical entities with quantities
•   Assigning chemical roles
•   Atom-atom mapping


                                                    5
Building on existing projects




                                6
Archetypal experimental section
                           Section heading

                            Section target
                             compound
     Step identifier
                              Step target
                              compound
Paragraph number
                               Synthesis



                                Workup


                            Characterisation




                                               7
Jessop, D. M.; Adams, S. E.; Murray-Rust, P.
Mining Chemical Information from Open
Patents. Journal of Cheminformatics 2011, 3, 40.




                                        8
ChemicalTagger
• Tags words of text

• Parses tags to identify phrases

• Generate XML parse tree
   – http://chemicaltagger.ch.cam.ac.uk/
   – Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for
     semantic text-mining in chemistry. J Cheminf 2011, 3, 17.




                                                                                        9
Tagging
•   Regex tagger: tags keywords e.g. “yield”, “mL”
•   OSCAR4 tagger: Finds names OSCAR4 believes to be chemical
    e.g. “2-methylpyridine”
•   OpenNLP: Tags parts of speech


Additional taggers:
• OPSIN tagger: Finds names OPSIN can parse
• Trivial chemical name tagger: Tags a few chemicals missed by
   the other taggers and cases that are partially matched by
   the regex tagger e.g. Dess-martin reagent


                                                            10
Sample ChemicalTagger Output
     <MOLECULE>
       <OSCARCM>
         <OSCAR-CM>methyl</OSCAR-CM>
         <OSCAR-CM>4-(chlorosulfonyl)benzoate</OSCAR-CM>
       </OSCARCM>
       <QUANTITY>
         <_-LRB->(</_-LRB->
         <MASS>
           <CD>606</CD>
           <NN-MASS>mg</NN-MASS>
         </MASS>
         <COMMA>,</COMMA>
         <AMOUNT>
           <CD>2.1</CD>
           <NN-AMOUNT>mmol</NN-AMOUNT>
         </AMOUNT>
         <COMMA>,</COMMA>
         <EQUIVALENT>
           <CD>1</CD>
           <NN-EQ>eq</NN-EQ>
         </EQUIVALENT>
         <_-RRB->)</_-RRB->
       </QUANTITY>
     </MOLECULE>

                                                           11
Phrase Identification




                        12
Quantity Identification




                          13
Section/Step Parsing




                       14
Pyridine, pyridines and pyridine rings


                        The pyridine /       Pyridines /    Pyridine ring /
 Entity   Pyridine
                     Pyridine from step 1    A pyridine         Pyridyl

 Type      Exact      DefiniteReference     ChemicalClass     Fragment




                                                                      15
Section/Step Parsing




Workup phrase types : Concentrate, Degass,
 Dry, Extract, Filter, Partition, Precipitate,
 Purify, Recover, Remove, Wash, Quench




                                                 16
Atom-mapping




               17
Example
Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate

To a solution of methyl 4-(chlorosulfonyl)benzoate (606
mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added
pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N
(540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred
at room temperature until all of the starting material was
consumed. The solvent was evaporated in vacuo and the
residue redissolved in ethyl acetate (10 ml), washed with
water (10 ml), saturated sodium hydrogen carbonate (10
ml), dried over sodium sulphate, filtered and evaporated to
yield the title compound as a white solid (690 mg, 1.8
mmol, 85%).

                                                         18
Graphical Output




                   19
CML output
<reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-..
 <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([..
 <productList>
  <product role="product">                                                                     Reaction SMILES
   <molecule id="m0">
    <name dictRef="nameDict:unknown">title compound</name>
   </molecule>
   <amount units="unit:mmol">1.8</amount>
   <amount units="unit:mg">690</amount>                                           Quantities including yield are extracted
   <amount units="unit:percentYield">85.0</amount>
   <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/>
   <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H..
   <dl:entityType>definiteReference</dl:entityType>
   <dl:state>solid</dl:state>                                                       SMILES and InChIs for every structure
  </product>                                                                               resolvable reagent/product
 </productList>
 <reactantList>                                  Entity is classified as an exact compound,
  <reactant role="reactant" count="1">
   <molecule id="m1">
                                              definite reference, chemical class or polymer
    <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name>
   </molecule>
   <amount units="unit:mmol">2.1</amount>
   <amount units="unit:mg">606</amount>
   <amount units="unit:eq">1.0</amount>
   <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/>



                                                                                                                                                  20
Evaluation
•   2008-2011 USPTO patent applications classified as containing
    organic chemistry  65,034 documents.

•   484,259 reactions atom mapped reactions extracted

•   Adding the additional requirements that all the identified
    product molecules were resolvable to structures and that all
    reagents were believed to describe exact compounds
     424,621 reactions.

•   100 of these were selected for manual evaluation of quality

                                                                  21
Reactions found
                                         100,000




                                          10,000
Patents with given number of reactions




                                           1,000




                                            100




                                             10




                                              1
                                                   0     200      400               600        800   1000
                                                               Number of extracted reactions




                                                                                                            22
Results
•   96% correctly identified the primary starting material and product
    whilst not misidentifying reagents that could be confused with the
    starting material

•   As compared to the 495 expected chemical entities there were 61 false
    positives and 16 false negatives

•   Only 4 of the 321 reagents (with quantities) did not have these
    quantities recognised and associated with the reagent

•   Association of quantities/yields with products was less successful, 48
    out of the 74 cases where such data was present were handled

                                                                             23
Use Cases
• Reaction searching

• Analysing trends in reactions over time

• Reaction outcome prediction




                                            24
Example of reaction searching
C[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1)




     6 reactions found in 5 patents


                                               25
Name I20110224.tarUS20110046406A1-20110224.ZIP0066




Text from US 2011/0046406 A1




                                                        26
Most lexical variants

1-ethyl-3-(dimethylaminopropyl)carbodiimide hydrochloride
EDCI hydrochloride
1-ethyl-3-[3-(dimethylamino)propyl]-carbodiimide hydrochloride
N-ethyl-N'-(3-dimethylamino-propyl)-carbodiimide hydrochloride
                                                                             And 127 more!
N-[3-(Dimethylamino) propyl]-N'-ethylcarbodiimide hydrochloride
1-(3-dimethylaminopropyl)-3-ethylcarbodiimide.HCl
N1-((Ethylimino)methylene)-N3,N3-dimethylpropane-1,3-diamine hydrochloride
N-(3-dimethylaminopropyl)-N'-ethylcarbodiimide hydrochloride
1-ethyl-3-dimethylaminopropyl-carbodiimide hydrochloride
1-(3-dimethylaminopropyl)-3-ethylcarbodiimide HCl
                                                                             675 chemicals had over
1-[3(dimethylamino)propyl]-3-ethylcarbodiimide hydrochloride
1-(-3-dimethylamino-propyl)-3-ethylcarbodiimide hydrochloride                10 lexical variants!
N-(3-Dimethylamino-1-propyl)-N'-ethylcarbodiimide hydrochloride
1-ethyl-3-(3-dimethylaminopropyl)carbodiimide monohydrochloride
1-(3-(Dimethylamino)propyl)-3-ethyl-carbodiimide hydrochloride



                                                                                                      27
Most common solvents




                       28
Known Limitations
•   The first workup reagent is often erroneously classified as a
    reactant

•   Atom mapping produces mappings that are not necessarily
    representative of reaction mechanism and occasionally
    involve clearly incorrect atoms

•   Conditions from analogous reactions are not resolved

•   Temperature/time for reactions to occur not captured



                                                                    29
Conclusions
• 424,621 exact atom-mapped reactions were
  extracted from 4 years of USPTO patent
  applications
• Evaluation indicates the reactions to be of
  generally good quality especially if the
  misidentification of workup reagents as
  reactants is not considered important
• All the code to extract reactions is open source:
  https://bitbucket.org/dan2097/patent-reaction-extraction

                                                        30
Acknowledgements
Unilever centre:                   Indigo toolkit:
Robert Glen                        Mikhail Rybalkin
Peter Murray-Rust                  Savelyev Alexander
Lezan Hawizy                       Dmitry Pavlov
David Jessop
Matthew Grayson
Boehringer Ingelheim for funding   SMARTS searching:
                                   Roger Sayle



                                                        31
Any Questions?




Email: daniel@nextmovesoftware.com


                                     32

Weitere ähnliche Inhalte

Ähnlich wie Automated Extraction of Reactions from the Patent Literature

Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Hitesh Patel
 
Virtual Reaction Service Using Chem Axon Reactor July06
Virtual Reaction Service Using Chem Axon Reactor July06Virtual Reaction Service Using Chem Axon Reactor July06
Virtual Reaction Service Using Chem Axon Reactor July06
DanielSButler
 
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verificationISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
Michel Dumontier
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...
Valery Tkachenko
 
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docxOrganic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
jacksnathalie
 
Harmony 2011: Formalization of SBML models as OWL ontologies
Harmony 2011: Formalization of SBML models as OWL ontologiesHarmony 2011: Formalization of SBML models as OWL ontologies
Harmony 2011: Formalization of SBML models as OWL ontologies
Michel Dumontier
 

Ähnlich wie Automated Extraction of Reactions from the Patent Literature (20)

Introduction to Chemoinformatics
Introduction to ChemoinformaticsIntroduction to Chemoinformatics
Introduction to Chemoinformatics
 
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
Synthetically Accessible Virtual Inventory (SAVI) : Reaction generation and h...
 
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
Can a Free Access Structure-Centric Community for Chemists Benefit Drug Disco...
 
SEMS: Model search and ranked Retrieval (Ron Henkel)
SEMS: Model search and ranked Retrieval (Ron Henkel)SEMS: Model search and ranked Retrieval (Ron Henkel)
SEMS: Model search and ranked Retrieval (Ron Henkel)
 
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical PatentsChemical Text Mining for Current Awareness of Pharmaceutical Patents
Chemical Text Mining for Current Awareness of Pharmaceutical Patents
 
Virtual Reaction Service Using Chem Axon Reactor July06
Virtual Reaction Service Using Chem Axon Reactor July06Virtual Reaction Service Using Chem Axon Reactor July06
Virtual Reaction Service Using Chem Axon Reactor July06
 
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verificationISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
ISMB2011 Tutorial: Biomedical Ontologies for data integration and verification
 
Novel materials for development of optical sensors
Novel materials for development of optical sensorsNovel materials for development of optical sensors
Novel materials for development of optical sensors
 
6-8-10 Presentation1 - Copy.ppt
6-8-10 Presentation1 - Copy.ppt6-8-10 Presentation1 - Copy.ppt
6-8-10 Presentation1 - Copy.ppt
 
Global content summit: Overview, content partnering, richness
Global content summit: Overview, content partnering, richnessGlobal content summit: Overview, content partnering, richness
Global content summit: Overview, content partnering, richness
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in Action
 
CRE-!-Lec.pptx
CRE-!-Lec.pptxCRE-!-Lec.pptx
CRE-!-Lec.pptx
 
Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...Need and benefits for structure standardization to facilitate integration and...
Need and benefits for structure standardization to facilitate integration and...
 
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docxOrganic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
Organic I Review Workbook – The Toolbox ALL STAR MOLECU.docx
 
The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...The importance of standards for data exchange and interchange on the Royal So...
The importance of standards for data exchange and interchange on the Royal So...
 
IRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomicsIRSAE aquatic ecology 28 June 2018 metabolomics
IRSAE aquatic ecology 28 June 2018 metabolomics
 
Cheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspectiveCheminformatics toolkits: a personal perspective
Cheminformatics toolkits: a personal perspective
 
Harmony 2011: Formalization of SBML models as OWL ontologies
Harmony 2011: Formalization of SBML models as OWL ontologiesHarmony 2011: Formalization of SBML models as OWL ontologies
Harmony 2011: Formalization of SBML models as OWL ontologies
 
"Productivity and Simplicity" - Streamlining Cumbersome Sample Preparation i...
"Productivity and Simplicity" -  Streamlining Cumbersome Sample Preparation i..."Productivity and Simplicity" -  Streamlining Cumbersome Sample Preparation i...
"Productivity and Simplicity" - Streamlining Cumbersome Sample Preparation i...
 
Determination of Common Counterions and Impurity Anions in Pharmaceuticals Us...
Determination of Common Counterions and Impurity Anions in Pharmaceuticals Us...Determination of Common Counterions and Impurity Anions in Pharmaceuticals Us...
Determination of Common Counterions and Impurity Anions in Pharmaceuticals Us...
 

Mehr von dan2097

Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
dan2097
 

Mehr von dan2097 (6)

From Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resourcesFrom Open text mining solutions to Open Data resources
From Open text mining solutions to Open Data resources
 
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
Tackling the difficult areas of chemical entity extraction: Misspelt chemical...
 
OPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclatureOPSIN: Taming the jungle of IUPAC chemical nomenclature
OPSIN: Taming the jungle of IUPAC chemical nomenclature
 
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical NomenclatureOPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
OPSIN: Taming the Jungle of IUPAC Chemical Nomenclature
 
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping AlgorithmsEvaluating the Quality and Performance of Automatic Atom Mapping Algorithms
Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms
 
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChIInChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
InChI vs IUPAC nomenclature: Aspects to be aware of when using Standard InChI
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Automated Extraction of Reactions from the Patent Literature

  • 1. Automated Extraction of Reactions from the Patent Literature Daniel Lowe Unilever Centre for Molecular Science Informatics University of Cambridge 1
  • 2. Chemistry patent applications • 100,000s applications each year 400000 350000 Chemistry patent applications per year 300000 250000 200000 150000 100000 50000 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 World Intellectual Property Indicators, 2011 edition 2
  • 3. 3
  • 4. The idea XML patents Reaction Extraction System Extracted Reactions 4
  • 5. Steps involved • Identifying experimental sections • Identifying chemical entities • Chemical name to structure conversion • Associating chemical entities with quantities • Assigning chemical roles • Atom-atom mapping 5
  • 6. Building on existing projects 6
  • 7. Archetypal experimental section Section heading Section target compound Step identifier Step target compound Paragraph number Synthesis Workup Characterisation 7
  • 8. Jessop, D. M.; Adams, S. E.; Murray-Rust, P. Mining Chemical Information from Open Patents. Journal of Cheminformatics 2011, 3, 40. 8
  • 9. ChemicalTagger • Tags words of text • Parses tags to identify phrases • Generate XML parse tree – http://chemicaltagger.ch.cam.ac.uk/ – Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for semantic text-mining in chemistry. J Cheminf 2011, 3, 17. 9
  • 10. Tagging • Regex tagger: tags keywords e.g. “yield”, “mL” • OSCAR4 tagger: Finds names OSCAR4 believes to be chemical e.g. “2-methylpyridine” • OpenNLP: Tags parts of speech Additional taggers: • OPSIN tagger: Finds names OPSIN can parse • Trivial chemical name tagger: Tags a few chemicals missed by the other taggers and cases that are partially matched by the regex tagger e.g. Dess-martin reagent 10
  • 11. Sample ChemicalTagger Output <MOLECULE> <OSCARCM> <OSCAR-CM>methyl</OSCAR-CM> <OSCAR-CM>4-(chlorosulfonyl)benzoate</OSCAR-CM> </OSCARCM> <QUANTITY> <_-LRB->(</_-LRB-> <MASS> <CD>606</CD> <NN-MASS>mg</NN-MASS> </MASS> <COMMA>,</COMMA> <AMOUNT> <CD>2.1</CD> <NN-AMOUNT>mmol</NN-AMOUNT> </AMOUNT> <COMMA>,</COMMA> <EQUIVALENT> <CD>1</CD> <NN-EQ>eq</NN-EQ> </EQUIVALENT> <_-RRB->)</_-RRB-> </QUANTITY> </MOLECULE> 11
  • 15. Pyridine, pyridines and pyridine rings The pyridine / Pyridines / Pyridine ring / Entity Pyridine Pyridine from step 1 A pyridine Pyridyl Type Exact DefiniteReference ChemicalClass Fragment 15
  • 16. Section/Step Parsing Workup phrase types : Concentrate, Degass, Dry, Extract, Filter, Partition, Precipitate, Purify, Recover, Remove, Wash, Quench 16
  • 18. Example Methyl 4-[(pentafluorophenoxy)sulfonyl]benzoate To a solution of methyl 4-(chlorosulfonyl)benzoate (606 mg, 2.1 mmol, 1 eq) in DCM (35 ml) was added pentafluorophenol (412 mg, 2.2 mmol, 1.1 eq) and Et3N (540 mg, 5.4 mmol, 2.5 eq) and the reaction mixture stirred at room temperature until all of the starting material was consumed. The solvent was evaporated in vacuo and the residue redissolved in ethyl acetate (10 ml), washed with water (10 ml), saturated sodium hydrogen carbonate (10 ml), dried over sodium sulphate, filtered and evaporated to yield the title compound as a white solid (690 mg, 1.8 mmol, 85%). 18
  • 20. CML output <reaction xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-.. <dl:reactionSmiles>Cl[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20]([.. <productList> <product role="product"> Reaction SMILES <molecule id="m0"> <name dictRef="nameDict:unknown">title compound</name> </molecule> <amount units="unit:mmol">1.8</amount> <amount units="unit:mg">690</amount> Quantities including yield are extracted <amount units="unit:percentYield">85.0</amount> <identifier dictRef="cml:smiles" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/> <identifier dictRef="cml:inchi" value="InChI=1/C14H7F5O5S/c1-23-14(20)6-2-4-7(5-3-6)25(21,22)24-13-11(18)9(16)8(15)10(17)12(13)19/h2-5H.. <dl:entityType>definiteReference</dl:entityType> <dl:state>solid</dl:state> SMILES and InChIs for every structure </product> resolvable reagent/product </productList> <reactantList> Entity is classified as an exact compound, <reactant role="reactant" count="1"> <molecule id="m1"> definite reference, chemical class or polymer <name dictRef="nameDict:unknown">methyl 4-(chlorosulfonyl)benzoate</name> </molecule> <amount units="unit:mmol">2.1</amount> <amount units="unit:mg">606</amount> <amount units="unit:eq">1.0</amount> <identifier dictRef="cml:smiles" value="ClS(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/> 20
  • 21. Evaluation • 2008-2011 USPTO patent applications classified as containing organic chemistry  65,034 documents. • 484,259 reactions atom mapped reactions extracted • Adding the additional requirements that all the identified product molecules were resolvable to structures and that all reagents were believed to describe exact compounds  424,621 reactions. • 100 of these were selected for manual evaluation of quality 21
  • 22. Reactions found 100,000 10,000 Patents with given number of reactions 1,000 100 10 1 0 200 400 600 800 1000 Number of extracted reactions 22
  • 23. Results • 96% correctly identified the primary starting material and product whilst not misidentifying reagents that could be confused with the starting material • As compared to the 495 expected chemical entities there were 61 false positives and 16 false negatives • Only 4 of the 321 reagents (with quantities) did not have these quantities recognised and associated with the reagent • Association of quantities/yields with products was less successful, 48 out of the 74 cases where such data was present were handled 23
  • 24. Use Cases • Reaction searching • Analysing trends in reactions over time • Reaction outcome prediction 24
  • 25. Example of reaction searching C[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1) 6 reactions found in 5 patents 25
  • 27. Most lexical variants 1-ethyl-3-(dimethylaminopropyl)carbodiimide hydrochloride EDCI hydrochloride 1-ethyl-3-[3-(dimethylamino)propyl]-carbodiimide hydrochloride N-ethyl-N'-(3-dimethylamino-propyl)-carbodiimide hydrochloride And 127 more! N-[3-(Dimethylamino) propyl]-N'-ethylcarbodiimide hydrochloride 1-(3-dimethylaminopropyl)-3-ethylcarbodiimide.HCl N1-((Ethylimino)methylene)-N3,N3-dimethylpropane-1,3-diamine hydrochloride N-(3-dimethylaminopropyl)-N'-ethylcarbodiimide hydrochloride 1-ethyl-3-dimethylaminopropyl-carbodiimide hydrochloride 1-(3-dimethylaminopropyl)-3-ethylcarbodiimide HCl 675 chemicals had over 1-[3(dimethylamino)propyl]-3-ethylcarbodiimide hydrochloride 1-(-3-dimethylamino-propyl)-3-ethylcarbodiimide hydrochloride 10 lexical variants! N-(3-Dimethylamino-1-propyl)-N'-ethylcarbodiimide hydrochloride 1-ethyl-3-(3-dimethylaminopropyl)carbodiimide monohydrochloride 1-(3-(Dimethylamino)propyl)-3-ethyl-carbodiimide hydrochloride 27
  • 29. Known Limitations • The first workup reagent is often erroneously classified as a reactant • Atom mapping produces mappings that are not necessarily representative of reaction mechanism and occasionally involve clearly incorrect atoms • Conditions from analogous reactions are not resolved • Temperature/time for reactions to occur not captured 29
  • 30. Conclusions • 424,621 exact atom-mapped reactions were extracted from 4 years of USPTO patent applications • Evaluation indicates the reactions to be of generally good quality especially if the misidentification of workup reagents as reactants is not considered important • All the code to extract reactions is open source: https://bitbucket.org/dan2097/patent-reaction-extraction 30
  • 31. Acknowledgements Unilever centre: Indigo toolkit: Robert Glen Mikhail Rybalkin Peter Murray-Rust Savelyev Alexander Lezan Hawizy Dmitry Pavlov David Jessop Matthew Grayson Boehringer Ingelheim for funding SMARTS searching: Roger Sayle 31

Hinweis der Redaktion

  1. Manual abstraction of the precise details of reactions from this many documents would be expensive.
  2. How can one get access to patents? Google patents offers all USPTO patents from 2001 onwards as XML including images and ChemDraw files. Older patents are available with just the text back to 1976, back to 1920 with OCRed text and back to 1790 if one OCRs themselves
  3. This problem can be broken down into several sub problems
  4. Fortunately we don’t have to start from scratch, many open source toolkit exist to help with these tasks. OPSIN, name to structure, OSCAR4, chemical entity recognition, ChemicalTagger, tagging and parsing of experimental chemistry text
  5. This is what a typical experimental section from a patent looks like. We need to identify such sections, link the heading with the paragraphs and preferably distinguish synthesis reagents from workup reagents.
  6. Heading/paragraphs can be extracted directly from the XML. The classifier uses the probabilities of words being present in an experimental chemistry section versus a standard paragraph. The language in experimental sections is quite repetitive so this works well. In some cases a heading may not be annotated as such in the XML, this can be detected in many cases and processed as if the heading was a discrete element.
  7. This work relies heavily on ChemicalTagger and significant improvements have been made to ChemicalTagger as part of this porject to improve its performance and range of concepts recognised. Hence a description of the system would not be complete without also explaining what ChemicalTagger does
  8. For this project we also use the following taggers. These tags can then be parsed to yield….
  9. Quantities have been recognised and marked up and associated with a molecule. Where certain key words are identified phrases can be identfied….
  10. A few phrase types are identified directly by the grammar e.g. a chemical in a chemical is a dissolve phrase
  11. Will be associated with the identified compound. As you can see a compound doesn’t have to contain a chemical entity. (title compound as a white solid)
  12. Uses a combination of textual clues and OPSIN’s classification
  13. Phrases can be classified into workup by phrase type e.g. extraction, purification. As the yielded compound and characterisation are often conjoined rather than explicitly identifying the workup compounds commonly associated with characterisation are marked up as false positives by regexes. A single paragraph may have multiple blocks of synthesis and workup. Structure-aware role assignment involves things like heuristically assigning known solvents as solvent and catalysts e.g. using lists of known solvents/catalysts and their properties e.g. transition metal
  14. Perform sanity check on reaction e.g. has a product and at least 2 reagents. Attempt to find mapping where all product atoms can be accounted for
  15. Here is an example of an experimental section
  16. Occasionally the system identifies a compound as a reactant that was mentioned only in the context of the current reaction being performed in an analogous way to the reaction that produced it. False positives arise from workup reagents being classified as reactants and clear errors. Product information often not explicitly associated with product.
  17. Simmons–Smith reaction for conversion of a terminal allyl group to a cyclopropane group found 6 hits in 5 patents.
  18. It should be noted that nowhere in this text and indeed in the whole patent is the name of the reaction mentioned, this is quite common.
  19. 675 chemical entities had over 10 lexical variants
  20. Top 10
  21. This is due to the text typically just saying that the substance is added without further specification of its purpose