Automatic Chemical Annotation of Large Full-Text Patent Corpora. Pitfalls, Challenges and User Benefits

InfoChem GmbH © 2015 Dr. Josef EiblmaierICIC 2015 Nice, October 18 – 21
1 / 37
Automatic chemical annotation of large full-
text patent corpora. Pitfalls, challenges and
user benefits
J. Eiblmaier, D. Geppert, L. Isenko, H. Saller
ICIC 2015 Nice, October 18 – 21

2 / 37
» Introduction: Chemical Named Entity
Recognition - One size fits all?
» Show Case Projects
 The Hard ‘n’ Heavy: Chemisches Zentralblatt
 The Tricky: Wiley Smart Article
 The Networked: Springer Chemistry Demonstrator
 The Elegant: PATENTSCOPE
 The Powerful: FIZ Karlsruhe Full-text Databases
» Conclusion
Outline
© cora / PIXELIO, www.pixelio.de

3 / 37
Outline
» Conclusion

4 / 37
On Size fits All?

5 / 37
Different Sources
Stefan Emilius / pixelio.de
[...] Wedged etched SiOz film 50 nm -} 2 nm with refractive index
measurement o~~~~~~~U-~ o » ~ ~ ~ ~ ~ ~ ~ ~ Psi (deg] oL-~~-L~~~~~
25 30 3S also be used for in situ thin film evaluation (4). Therefore the
ellipsometric heads are at- tached to the specific process chamber via opti -
cal windows. De- pending on growth/etch rates thicknesses can be
measured (end- point) with ac- curacies in the .••• - 81"",laUG" - " .. IU,.,..."t
0 1 1 0 1 1 2 0 Psi (deg[a) Theoretical Psi/Delta curves of an etched SiOz
film on silicon b) growth of a-Si on glass subs tate [...]
, Different Aims!
XML
PDF
HTML
TXT
SGML

6 / 37
Discoverability

7 / 37
Interactivity
*Reinhard Neudert: Enhancing the User Experience for Wiley Chemistry Content, ICIC 2012 14. – 17. October, Berlin

8 / 37
Semantics
Valium OR DiazepamValium OR Diazepam OR Ansiolisina OR Diazemuls OR Relanium OR Stesolid
OR Apaurin OR Faustan OR Seduxen OR Sibazon OR Methyldiazepinone OR
Calmocitene OR Neurolytril OR Bialzepam OR Ceregulart OR Condition OR
Diazetard OR Liberetas OR Relaminal OR Serenamin OR Tranquirit OR Ansiolin
OR Apozepam OR Atensine OR Bensedin OR Calmpose OR Diacepan OR
Diazepan OR Dipezona OR Domalium OR Kiatrium OR Paranten OR Quetinil
OR Quiatril OR Quievita OR Renborin OR Ruhsitus OR Seduksen OR Serenack
OR Serenzin OR Stesolin OR Tensopam OR Horizon OR Lembrol OR Morosan
OR Saromet OR Sedipam OR Setonil Anxionil OR Benzopin OR Calmaven OR
Chuansuan OR Desconet OR Desloneg OR Diaceplex OR Diazepin OR
Gewacalm OR Jinpanfan OR Mentalium OR Metamidol OR Nixtensyn OR
Novodipam OR Pacitran OR Paralium OR Prozepam OR Psychopax OR
Radizepam OR Simasedan OR Trankinon OR Trazepam OR Valaxona OR
Valiquid OR Valuzepam OR Vanconin OR Antenex OR Arzepam OR Betapam
OR Diapine OR Diaquel OR 7-Chloro-1,3-dihydro-1-methyl-5-phenyl-2H-1,4-
benzodiazepin-2-one OR NCGC00178168-01 OR WLN: T67 GNV JN IHJ CG G1
KR OR 2H-1,4-Benzodiazepin-2-one, 7-chloro-1,3-dihydro-1-methyl-5-phenyl-
OR CPD000058398 OR SAM001246536 OR SMR000058398 OR 439-14-5 OR
7-Chloro-1-methyl-5-phenyl-3H-1,4-benzodiazepin-2(1H)-one OR 7-Chloro-1-
methyl-2-oxo-5-phenyl-3H-1,4-benzodiazepine OR 7-Chloro-1-methyl-5-phenyl-
2H-1,4-benzodiazepin-2-one OR C06948 OR D00293 OR 5-24-04-00300 OR
D003975 OR A3662/0155188 OR I06-0194 OR 1-Methyl-5-phenyl-7-chloro-1,3-
dihydro-2H-1,4-benzodiazepin-2-one OR 7-Chloro-1-methyl-5-3H-1,4-
benzodiazepin-2(1H)-one OR 7-chloro-1-methyl-5-phenyl-3H-1,4-benzodiazepin-
2-one OR DZP OR Dap OR Pax OR 11100-37-1 OR 53320-84-6 OR
InChI=1/C16H13ClN2O/c1-19-14-8-7-12(17)9-13(14)16(18-10-15(19)20)11-5-3-
2-4-6-11/h2-9H,10H2,1H
...
(361 Depositor-Supplied Synonyms in PubChem)

9 / 37
Linkability

10 / 37
Structurability

11 / 37
Outline
» Conclusion

12 / 37
The Hard ´n Heavy: Chemisches Zentralblatt
© Joachim Reisig / pixelio.de
Named Entity Recognition project to create
» Web based, language independent
structure database
» First and oldest abstracts journal in chemistry,
covering chemical literature from 1830 to 1969
» Two million abstracts (900,000 pages, German)

13 / 37
The Hard ´n Heavy: Chemisches Zentralblatt
» Print from back page
» Bad quality of original source: blotted and stained pages

14 / 37
1870
Challenges OCR
1830
1910 1930
1969

15 / 37
» Ambiguous old fonts (h=b; c=e; ligations)
» Spaced text
 Specific rules, large German dictionaries and extensive training are
applied to correct systematic errors of standard OCR process
Challenges OCR

16 / 37
Challenges Annotation
» Obsolete German language
 Schwefelsaures Natrium, Aurantiin
 Chlorür, Bromür, Oxydul
» Historical names
 Pelopeum  Columbium  Niobium
» Different spelling for the same name:
 Dibrom…  Bibrom…
 Ätzkali  Aetzkali
 

17 / 37
Results
» 2.4 million chemical names with
associated structure
 1 Million unique names
 500,000 unique structures
 Results linked to original source (PDF)
» Benchmark
 Recall: 53,8%
 Precision: 89,7%

18 / 37
The Networked: Springer Chemistry Demonstrator
© Stephanie Hofschlaeger / pixelio.de
» Large scale automatic extraction of chemical
entities from SpringerLink documents
» Joint definition of output formats (inline and
standoff XML)
» Semantic enrichment of chemically relevant
SpringerLink documents (> 2,700 titles)
» Creation of a chemical registry including all
chemistry sources of Springer / InfoChem
having structural information
» Implementation of an online-demonstrator,
Interlink different data repositories via the
chemical structure

19 / 37
Structures
Structures
(annotated)
Structures
Reactions
Database
Structures
--------
Full-text
Structures
(annotated)
Full-text
Structures
(annotated)
2,000
annotated
documents
Central Compound Registry
Compound
Registry

20 / 37

21 / 37

22 / 37
The Tricky: Wiley Smart Article
» Real time bimodal extraction of chemical
information
 ChemDraw files
 Full-text (five journals, two reference works)
» Merged into the Wiley XML
» Workflow system KNIME developed at
InfoChem, installed at Wiley Hoboken
» Chemicals, Reagents/Catalysts, Drugs,
ReactionTypes, Chem. Technology
» Challenge: merging of information from different
domains and sources (overlapping, conflicting,
missing concepts)

23 / 37
Text annotation: Chemistry enrichment workflow*

24 / 37
XML
SDfile
ICScheme
Processor
XML+
HTML

25 / 37
(…) <enrichedObject relevance=“primary" xml:id="asia266-eo-1234"
associatedDataRef=“#asia266-sch-0016”> <label>6</label><mediaResource
mimeType="chemical/x-mol-file„href="enrich_out/asia266-eo-1234.sdf" alt="chemical
compound"/> </enrichedObject> (…)
(…) <infoAsset type="drugGenericName chemicalName" xml:id="asia573-info-
0004">Himbacine</infoAsset> (<link href="#asia573-eo-0001"/>),<link href="#bib1">1</link> has
shown (…)
shown (…)
shown (…)
shown (…)

26 / 37

27 / 37
The Elegant: Addition of chemical search capabilities to
the WIPO PATENTSCOPE search system
» Chemical annotation of PATENTSCOPE
full-text patent documents
» Replacement of detected compounds by
their corresponding InChIKey (IUPAC
International Chemical identifier key)
» Recognition of graphical representations of
chemical compounds in PATENTSCOPE
documents
» Solr/Lucene based exact structure search
» Workflow system KNIME
» Extension of the PATENTSCOPE GUI with
chemical structure search

28 / 37
(…) At the moment the surgical procedure starts, benzodiazepin, e.g.
diazepam, is administered in a dose of no more than 5 mg. (…)
(…) At the moment the surgical procedure starts, benzodiazepin, e.g.
@AAOVKJBEBIDNHE-UHFFFAOYSA-N@, is administered in a dose of
no more than 5 mg. (…)

29 / 37
PATENTSCOPE
Documents
Enriched PATENTSCOPE
Documents
(…) At the moment the surgical
procedure starts, benzodiazepin, e.g.
diazepam, is administered in a dose of
no more than 5 mg. (…)
(…) At the moment the surgical procedure
starts, benzodiazepin, e.g.
@AAOVKJBEBIDNHE-UHFFFAOYSA-N@,
is administered in a dose of no more than 5
mg. (…)
AAOVKJBEBIDNH
E-UHFFFAOYSA-N

30 / 37

31 / 37

32 / 37
The Powerful: ‚Chemical Annotation for Online Systems‘
» Joint research project FIZ Karlsruhe /
InfoChem
» Chemical annotation of full-text patent
databases (tens of millions of documents)
» Ca. 800,000 updated documents / week
» First show case: EP full-text patents

33 / 37
» First show case EP full-text patents
 Bibliographic data and full text of patent applications / granted patents
published by the European Patent Office
 1978 – present, weekly updates
 More than 4.05 million family records with more than 6.9 million
publications (8/15)
» Main challenges of the annotation: performance, quality, integrability
» Integration of ICANNOTATOR into a Hadoop Environment / FIZ
proprietary workflow system (first results: „86 ms/document with
Hadoop and 2,1s without Hadoop“)
» Joint definition of output formats based on FIZ XML DTD
» Joint creation of annotation guidelines and a benchmark set of 100
manually annotated patent documents
» Envisaged user scenarios:
 searching structural chemical information from full-texts
 linking of text and structural information
 support evaluation of patent full-texts

34 / 37
Outline
» Conclusion

35 / 37

36 / 37
» Wiley
 Michael Forster
 Reinhard Neudert
» FIZ Karlsruhe
 Leni Helmes
 Michael Schwantner
» WIPO
 Christophe Mazenc
 Paul Halfpenny
» The InfoChem Team
Acknowledgements
© P. Storz / PIXELIO, www.pixelio.de

37 / 37
4.bp.blogspot.com/.../s1600/thank-you.jpg

Automatic Chemical Annotation of Large Full-Text Patent Corpora. Pitfalls, Challenges and User Benefits

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (6)

Andere mochten auch

Andere mochten auch (17)

Ähnlich wie Automatic Chemical Annotation of Large Full-Text Patent Corpora. Pitfalls, Challenges and User Benefits

Ähnlich wie Automatic Chemical Annotation of Large Full-Text Patent Corpora. Pitfalls, Challenges and User Benefits (20)

Mehr von Dr. Haxel Consult

Mehr von Dr. Haxel Consult (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Automatic Chemical Annotation of Large Full-Text Patent Corpora. Pitfalls, Challenges and User Benefits