11. OCR lexicon: example IMPACT <Demo Day BL, 12 July 2011> 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61
12.
13. IMPACT <Demo Day BL, 12 July 2011> <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> < modern_lemma > aantuilen </modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> < written_form > tuyld </written_form> <attestation><id>92141</id> <token_id></token_id> < quote >Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an , Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform>
14. Tools for lexicon building and application of lexica IMPACT <Demo Day BL, 12 July 2011>
20. A number of results for Dutch and German IMPACT <Demo Day BL, 12 July 2011>
21. Ground truth data: Dutch IMPACT <Demo Day BL, 12 July 2011> Type and genre # words Gold Standard Book 300k Random Set Books 340k Random Set Staten Generaal (Legal Papers) 2.5M Gold Standard Staten Generaal 500k Gold Standard Newspapers 1 3.4M Gold Standard Newspapers 2 170k Random Set Newspapers 3.2M total 13.1M
22. Lexicon coverage (1: ground truth books) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 46% 76% Core general lexicon 56% 84% 1 + 2 63% 89% Expansion with corpus material 78% 95%
23. Lexicon coverage (2: GT newspapers 18 th -19 th C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 40% 83% Core general lexicon 41% 84% 1 + 2 51% 89% Expansion with corpus material 62% 95%
24. Lexicon coverage (3: GT Staten Generaal 19 e C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 51% 89% Core general lexicon 47% 88% 1 + 2 58% 93% Expansion with corpus material 68% 97%
25. Lexicon coverage (4: GT Staten Generaal 20 e C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 70% 93% Core general lexicon 66% 93% 1 + 2 76% 96% Expansion with corpus material 81% 98%
26. Lexicon coverage (5: Genesis, 1637 bible) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 31% 61% Core lexicon 62% 83% 1 + 2 65% 89% Expansion with corpus material 87% 98.6%
27. Lexicon coverage (6: P.C. Hooft, histories) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 26% 67% Core lexicon 47% 88% 1 + 2 50% 90% Expansion with corpus material 58% 96%
28.
29. OCR results: word recognition rate IMPACT <Demo Day BL, 12 July 2011> Dataset With ABBYY internal Dutch lexicon With IMPACT lexicon for Dutch (case hyphenation) With IMPACT lexicon for Dutch (case hyphenation) + long S problem) DPO35 88.8% 90.9% 93,5 %
30. An example: IMPACT <Demo Day BL, 12 July 2011> OCR at the beginning of the project: Results: A. De eerde was de gevaarlykflti om de verleiÂŹ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
31. IMPACT <Demo Day BL, 12 July 2011> Dictionary 16 th century No. of word errors Reduction of error rate 18 th century No. of word errors Reduction of error rate 19 th century No. of word errors Reduction of error rate No Lexicon 1306 - 827 - 2074 - Optimal Lexicon 756 42% 395 52% 612 70% Modern Lexicon 1096 16% 501 39% 888 57% W.Historical Lexicon 938 28% 481 42% 856 59% Modern + Virtual H.L. 1011 25% 480 42% 849 59%
32.
33.
34. IMPACT <Demo Day BL, 12 July 2011> An indemnity shall be granted to the surferâŠ. ⊠bikini âŠ
35.
Hinweis der Redaktion
This presentation is based on how the INL works with language. A electronic dictionary is not what we need for OCR and simple retrieval but is introduced anyway because we can (and do) use our dictionaries for lexicon construction.
This is what an XML-based electronic dictionary looks like.
This is the XML of the Oxford English dictionary. The horizontal lines mark a place where part of the structure has been folded in.
<ed> We need further explanation for what âlemmaâ, âpart of speechâ and âmorphologyâ mean Lemma: headword, like in an ordinary dictionary the entry Morphology: morphological analysis is done for compounds and derivates: which parts are to be distinguished in a word, e.g. apple pie : apple + pie
This is an little part of a computational lexicon (of a certain type; there are many types of computational lexica)
<ed> again, unsure of what LEMMA means Be, was, am, is, etc. all forms of the same word BE (and that is an example of a lemma)
Two types of variation, examples for Dutch from the lexicon
To give an indication of possible spelling variants of the word âworldâ for English, a screenshot from the OED online...
These are some of the ways in which we are using Computer lexica as building blocks.
The
The
The
The
The
The
These are results with a rather limited historical lexicon of German.