Chemical ontologies represent abstractions of chemical compounds - providing structural as well as functional and chemical property classifications. With automated patent text processing there is also an increasing interest to automatically classify chemical compounds in patent documents to enable chemical searches based on known chemical classes.
Thus, we will present strategies to automatically classify chemical compounds based on their names and chemical structure or function using a chemical ontology derived from the pure lexical variants MeSH and ChEBI but incorporating SMARTS and chemical calculation based logic. We will describe the development of this ontology - comprising also functional classifications and material science terms such as alloys and polymers.
Using our UIMA based OCMiner annotation pipeline, over 90 million patent full text documents were extracted to find mentions of chemical compounds, substances, chemical classes and chemical groups. In addition, the claimed uses of these compounds were also extracted. Subsequently, chemical terms were classified by our chemical ontology, transforming more than 10 billion found chemical class mentions into an ontology enabled, Lucene based search index. This index was also used to analyze the frequency of found chemical classes per time period, giving indications on the focus of general chemical reseach activities and recent trends in patenting strategies.
An annotated data set of 10 years US patents is freely available for further investigations and can be used to train and develop further the use, quality and interchangeability of chemical ontologies.
Russian Call girls in Dubai +971563133746 Dubai Call girls
Evaluating Patent Full Text Documents with Chemical Ontologies
1. Evaluating patent full text
documents with chemical ontologies
OntoChem IT Solutions GmbH
Blücherstr. 24
06120 Halle (Saale)
Germany
Tel. +49 345 4780472
Fax: +49 345 4780471
mail: info(at)ontochem.com
2. Evaluating patent full text
documents with chemical ontologies
• spin-out from OntoChem GmbH
• started 1.7.2015
• 15 chemists, bioinformatics, biologists,
linguists, pharmacists
• extracting knowledge from documents,
selling software & services
OntoChem IT Solutions GmbH
Blücherstr. 24
06120 Halle (Saale)
Germany
Tel. +49 345 4780472
Fax: +49 345 4780471
mail: info(at)ontochem.com
3. 3
Computer readable, formal representation of knowledge...
describe relationships between knowledge concepts:
aspirin benzoic acid carboxylic acid
acetyl salicylic acids
can be used to infer extract, search, sort and analyse knowledge
What are Ontologies ?
„is a“ „is a“
4. 4
ChEBI Chemical Entities of Biological Interest
https://www.ebi.ac.uk/chebi/ has about 40,000 compounds manually classified:
MeSH – medical subject headings ... PubChem
Chemical Ontologies...
5. 5
SODIAC:
automated compound classification software
Structure based Ontology Development and Individual Assignment Center
ontology editor, OBO specification conformity
Definition of compound classes via SMARTS
chemical structure editor
sub-structure AND, OR and NOT logic compound to class assignment
chemistry error detection
chemical hierarchy construction
Classifying Chemistry: SODIAC
6. 6
SODIAC:
AND/OR logic to assign Vitamin C derivatives:
• described in different tautomeric forms in databases
• logic needed for classifying correct stereochemistry in substituted compounds
Classifying Chemistry: SODIAC
concept: Vitamin C derivatives
AND AND AND
OR OR
7. 7
structural chemical ontologies are often not based on sub-structures !
Progesterone 19-Norprogesterone 4-8* more active
class: Gestagens class: Gestagens>Progestins
Pregnane (female hormons) Androstane (male hormons)
class: Gonans>Pregnans class: Gonans>Estrans
Classifying Chemistry: not straightforward...
drugbank & ChEBI:
Progestin,
a synthetic progestogen
parent
& SSS
not parent
but SSS
not parent
but SSS
ChEBI:
corticosteroid hormone
same family
different family
8. 8
Chemistry Ontologies
Organic chemistry
7.586 class concepts, 29.709 class terms
3,185 concepts linked to ChEBI concepts
2,465 concepts linked to MeSH concepts
68 million concepts linked to PubChem
Inorganic materials
52.4209 concepts, 56.332 terms
Groups-substituents-fragments
4.428 concepts, 12.754 terms
Substances
989 concepts, 3.522 terms
Polymers
2361 concepts, 7.176 terms
14. 14
Understanding Patents with Ontologies
NLP for patents pose some unique challenges:
• multilingual
• poor OCR (optical character recognition)
• multi-disciplinary
• many
>90 million full text documents from >110 patent offices
• large
up to 500 pages
with sentences spanning >20 pages
• obscure:
hand drawings
unclear language
15. 15
Understanding Patents
Collaboration with infoapps GmbH (Munich)
Standard full text data
US, EP, DE, WO,
AT, CH, BE, CA, ES, FR, GB, MA.
Standard full text data
AR, BR, CN, DK, FI, ID, EI, EN,
JP, KR, MX, MY, NL, NO, RU, SE,
TH, TW, VN.
Original full text data
Machine/human translation (EN)
AR, AT, BE, BR, CA, CH, CN, DE,
DK, EP, ES, FI, FR, ID, JP, KR,
MX, NL, NO, RU, SE, TH, TW,
VN, WO.
16. 16
chemistry annotator
OCMiner® UIMA Pipeline
identify
document
type
OCMiner® UIMA Pipeline
picture PDF
OCR
Text PDF
PDF
reader
XML doc
XML
reader
Office doc
Office
reader
document
classifier
XML
detagger
language
detector
normalize
text
tokenize
text
acronym
abbrev
detector
person
annotator
document
structure
domain
annotators
1…n
dictionary
name-2-
structure
formula &
molpuzzler
class/group
resolution
cleanup &
rule
combiner
coordinated
entity
resolution
context
handler
NE
confidence
domain
annotators
1…n
domain
annotators
1…n
relationship
extraction consumer
BRAT
consumer
index
consumer
XML
17. 17
BRAT (Goran Topić) file example:
PLoS One. 2014 Sep 30;9(9):e107477. doi: 10.1371/journal.pone.0107477. eCollection 2014.
Annotated chemical patent corpus: a gold standard for text mining.
Akhondi SA, Klenner AG, Tyrchan C, Manchala AK, Boppana K, Lowe D, Zimmermann M, Jagarlapudi SA, Sayle R,
Kors JA, Muresan S
Regular Names in Patents
21. 21
3 reasons:
patent claims are „ontological“
background knowledge helps to extract the meaning of named entities
end user, using knowledge classifications
which natural product compound class is useful to treat inflammation of the skin?
Ontologies – Why ?
22. 22
Patent claims are “ontological”
Patent classes & ad hoc classes:
e.g. chemical
„compounds according to claim 1“
„acyl-pyrrolopyridines“
any Markush structure, Patent classes etc
e.g. uses: „anti-infectives“ (e.g. antibacterial, antiviral, antiparasitic ... )
Chemical Ontologies – Why ?
23. 23
ontology based NLP to extract the meaning of named entities
• ontology based context sensitive Named Entity resolution
...glucose... ...glucose oxidase... ...glucose oxidase activity...
finally: ...inhibitor of glucose oxidase activity...
• ontology based anaphora & cataphora resolution
Tetrahydrofurane is a commonly used solvent in organic ...
This cyclic ether has a melting point of -108,4 °C
• ontology based fingerprints
classifying documents, e.g. into patent classes
Chemical Ontologies – Why ?
24. 24
3 BRAT parts of one document:
Ontology Based Property Extraction
25. 25
Understanding Patent Claims Logic
high quality patent annotations need:
• annotated text corpus “Gold Set”
• background ontologies
Annotated between <chemistry> & <disease>: p=is_Active_Part_Of, i=is_Instance_Of.
LREC 2014: Creating a Gold Standard Corpus for the Extraction of Chemistry-Disease Relations
from Patents, Antje Schlaf, Claudia Bobach, Matthias Irmer
31. 31
End User: Patent Big Data Analytics
Hot Compounds, hot targets ?
L. Weber, T. Böhme, M. Irmer, Pharm. Pat. Analyst 2013, 2,
Ontology-based content analysis of US patent applications from 2001–2010