SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Computer Lexica in OCR and Retrieval Katrien Depuydt (Instituut voor Nederlandse Lexicologie, Leiden)
Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
What is a computer lexicon?  IMPACT <Demo Day BL, 12 July 2011>
Computer lexicon vs electronic dictionary (1) IMPACT <Demo Day BL, 12 July 2011> An electronic dictionary is:  ,[object Object],[object Object],[object Object],[object Object]
Dictionary XML (example) IMPACT <Demo Day BL, 12 July 2011>
IMPACT <Demo Day BL, 12 July 2011>
Computer Lexicon vs Electronic Dictionary (2) IMPACT <Demo Day BL, 12 July 2011> ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
IMPACT <Demo Day BL, 12 July 2011>
Lexica in IMPACT IMPACT <Demo Day BL, 12 July 2011>
The OCR lexicon IMPACT <Demo Day BL, 12 July 2011> An OCR lexicon is   ,[object Object],[object Object],[object Object],[object Object]
OCR lexicon: example IMPACT <Demo Day BL, 12 July 2011> 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61
The IR lexicon  ,[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
IMPACT <Demo Day BL, 12 July 2011> <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> < modern_lemma > aantuilen </modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> < written_form > tuyld </written_form> <attestation><id>92141</id> <token_id></token_id> < quote >Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en  tuyld  daer weer op  an , Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform>
Tools for lexicon building and application of lexica IMPACT <Demo Day BL, 12 July 2011>
Types variation (spelling, inflection
) IMPACT <Demo Day BL, 12 July 2011> uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken  uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk  uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk  I werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds  weerlyt  wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds  sweerels   zwerlys   swarels   swerelts  werelts  swerrels  weirelts tsweerelds  werret  vverelt werlts werrelt  worreld  werlden  wareld   weirelt weireld  waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje  weurlt wald weëled   II (patterns to predict variation) (a number are predictable with patterns, others need to be taken from a lexicon )
Neil Fitzgerald, 7th July 2011
Computer lexica ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
Tools (more specific) ,[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
Ordinary words vs Names (NEs) ,[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
A number of results for Dutch and German IMPACT <Demo Day BL, 12 July 2011>
Ground truth data: Dutch IMPACT <Demo Day BL, 12 July 2011> Type and genre # words Gold Standard Book 300k Random Set Books 340k Random Set Staten Generaal (Legal Papers) 2.5M Gold Standard Staten Generaal 500k Gold Standard Newspapers 1 3.4M Gold Standard Newspapers 2 170k Random Set Newspapers 3.2M total 13.1M
Lexicon coverage (1: ground truth books) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 46% 76% Core general lexicon 56% 84% 1 + 2 63% 89% Expansion with corpus material  78% 95%
Lexicon coverage  (2: GT newspapers 18 th -19 th  C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 40% 83% Core general lexicon 41% 84% 1 + 2 51% 89% Expansion with corpus material 62% 95%
Lexicon coverage  (3: GT Staten Generaal 19 e  C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 51% 89% Core general lexicon 47% 88% 1 + 2 58% 93% Expansion with corpus material 68% 97%
Lexicon coverage  (4: GT Staten Generaal 20 e  C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 70% 93% Core general lexicon 66% 93% 1 + 2 76% 96% Expansion with corpus material 81% 98%
Lexicon coverage (5: Genesis, 1637 bible) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 31% 61% Core lexicon 62% 83% 1 + 2 65% 89% Expansion with corpus material 87% 98.6%
Lexicon coverage (6: P.C. Hooft, histories) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 26% 67% Core lexicon 47% 88% 1 + 2 50% 90% Expansion with corpus material 58% 96%
Evaluation of OCR IMPACT <Demo Day BL, 12 July 2011> ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
OCR results: word recognition rate IMPACT <Demo Day BL, 12 July 2011> Dataset With ABBYY internal Dutch lexicon With IMPACT lexicon for Dutch  (case hyphenation) With IMPACT lexicon for Dutch  (case hyphenation) + long S problem) DPO35 88.8% 90.9% 93,5 %
An example: IMPACT <Demo Day BL, 12 July 2011> OCR at the beginning of the project: Results: A. De  eerde   was de  gevaarlykflti  om de verleiÂŹ ding aan 't Hof; de tweede de  ftillie  en  veiligde ; de derde de  zwaarde , daar hy byna drie millioenen harde en  onbefchaafde   Menfchen   beftieren  moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
IMPACT <Demo Day BL, 12 July 2011> Dictionary 16 th  century No. of  word errors Reduction of error rate 18 th  century  No. of  word errors Reduction of error rate 19 th  century  No. of  word errors Reduction of error rate No Lexicon 1306 - 827 - 2074 - Optimal Lexicon 756 42% 395 52% 612 70% Modern Lexicon 1096 16% 501 39% 888 57% W.Historical Lexicon 938 28% 481 42% 856 59% Modern + Virtual H.L. 1011 25% 480 42% 849 59%
Languages in IMPACT ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
English in IMPACT ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>
IMPACT <Demo Day BL, 12 July 2011> An indemnity shall be granted to the surfer
. 
  bikini 

Retrieval demonstrator ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],IMPACT <Demo Day BL, 12 July 2011>

Weitere Àhnliche Inhalte

Andere mochten auch

IMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
IMPACT Final Conference - Centre of Competence introduction - Hildelies BalkIMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
IMPACT Final Conference - Centre of Competence introduction - Hildelies BalkIMPACT Centre of Competence
 
BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
BL Demo Day - July2011 - (7) OCR Profiler and Post-CorrectionBL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
BL Demo Day - July2011 - (7) OCR Profiler and Post-CorrectionIMPACT Centre of Competence
 
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension ParserBL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension ParserIMPACT Centre of Competence
 
BL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCRBL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCRIMPACT Centre of Competence
 
BL Demo Day - July2011 - (2) IMPACT Learning Resources
BL Demo Day - July2011 - (2) IMPACT  Learning ResourcesBL Demo Day - July2011 - (2) IMPACT  Learning Resources
BL Demo Day - July2011 - (2) IMPACT Learning ResourcesIMPACT Centre of Competence
 
IMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEPIMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEPIMPACT Centre of Competence
 
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Centre of Competence
 

Andere mochten auch (8)

IMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
IMPACT Final Conference - Centre of Competence introduction - Hildelies BalkIMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
IMPACT Final Conference - Centre of Competence introduction - Hildelies Balk
 
Fep bne demoday
Fep bne demodayFep bne demoday
Fep bne demoday
 
BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
BL Demo Day - July2011 - (7) OCR Profiler and Post-CorrectionBL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
BL Demo Day - July2011 - (7) OCR Profiler and Post-Correction
 
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension ParserBL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
BL Demo Day - July2011 - (8) IMPACT Functional Extension Parser
 
BL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCRBL Demo Day - July2011 - (3) Image Enhancement for OCR
BL Demo Day - July2011 - (3) Image Enhancement for OCR
 
BL Demo Day - July2011 - (2) IMPACT Learning Resources
BL Demo Day - July2011 - (2) IMPACT  Learning ResourcesBL Demo Day - July2011 - (2) IMPACT  Learning Resources
BL Demo Day - July2011 - (2) IMPACT Learning Resources
 
IMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEPIMPACT Final Conference - Muehlberger - FEP
IMPACT Final Conference - Muehlberger - FEP
 
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
IMPACT Final Conference - Research Parallel Sessions02 research session_ncsr_...
 

Ähnlich wie Language Tools for OCR with Katrien Depuydt

Microsoft Power Point Neuro Disorders
Microsoft Power Point   Neuro DisordersMicrosoft Power Point   Neuro Disorders
Microsoft Power Point Neuro DisordersNio Noveno
 
Lotusphere 2006 AD212 Introduction to DXL
Lotusphere 2006 AD212 Introduction to DXLLotusphere 2006 AD212 Introduction to DXL
Lotusphere 2006 AD212 Introduction to DXLdominion
 
Alabot
AlabotAlabot
AlabotGaurav P
 
XML Training Presentation
XML Training PresentationXML Training Presentation
XML Training PresentationSarah Corney
 
Akoma Ntoso 2
Akoma Ntoso 2Akoma Ntoso 2
Akoma Ntoso 2tbruce
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaAjax Experience 2009
 
XML and XPath with PHP
XML and XPath with PHPXML and XPath with PHP
XML and XPath with PHPTobias Schlitt
 
"Why the Semantic Web will Never Work" (note the quotes)
"Why the Semantic Web will Never Work"  (note the quotes)"Why the Semantic Web will Never Work"  (note the quotes)
"Why the Semantic Web will Never Work" (note the quotes)James Hendler
 
Intro XML for archivists (2011)
Intro XML for archivists (2011)Intro XML for archivists (2011)
Intro XML for archivists (2011)Jane Stevenson
 
NEOOUG 2010 Oracle Data Integrator Presentation
NEOOUG 2010 Oracle Data Integrator PresentationNEOOUG 2010 Oracle Data Integrator Presentation
NEOOUG 2010 Oracle Data Integrator Presentationaskankit
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaganohmad
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Sagakaven yan
 
The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationIconic Translation Machines
 
ISOcat to LMF to TEI
ISOcat to LMF to TEIISOcat to LMF to TEI
ISOcat to LMF to TEIMenzo Windhouwer
 
Metadata Cloud
Metadata CloudMetadata Cloud
Metadata CloudNorm Friesen
 

Ähnlich wie Language Tools for OCR with Katrien Depuydt (20)

Language tools bne-5-10-2011
Language tools bne-5-10-2011Language tools bne-5-10-2011
Language tools bne-5-10-2011
 
Microsoft Power Point Neuro Disorders
Microsoft Power Point   Neuro DisordersMicrosoft Power Point   Neuro Disorders
Microsoft Power Point Neuro Disorders
 
Pmm05 16
Pmm05 16Pmm05 16
Pmm05 16
 
Lotusphere 2006 AD212 Introduction to DXL
Lotusphere 2006 AD212 Introduction to DXLLotusphere 2006 AD212 Introduction to DXL
Lotusphere 2006 AD212 Introduction to DXL
 
Alabot
AlabotAlabot
Alabot
 
XML Training Presentation
XML Training PresentationXML Training Presentation
XML Training Presentation
 
Akoma Ntoso 2
Akoma Ntoso 2Akoma Ntoso 2
Akoma Ntoso 2
 
Douglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation JsonsagaDouglas Crockford Presentation Jsonsaga
Douglas Crockford Presentation Jsonsaga
 
The ISO-DCR
The ISO-DCRThe ISO-DCR
The ISO-DCR
 
XML and XPath with PHP
XML and XPath with PHPXML and XPath with PHP
XML and XPath with PHP
 
XML
XMLXML
XML
 
Lecture 5 XML
Lecture 5  XMLLecture 5  XML
Lecture 5 XML
 
"Why the Semantic Web will Never Work" (note the quotes)
"Why the Semantic Web will Never Work"  (note the quotes)"Why the Semantic Web will Never Work"  (note the quotes)
"Why the Semantic Web will Never Work" (note the quotes)
 
Intro XML for archivists (2011)
Intro XML for archivists (2011)Intro XML for archivists (2011)
Intro XML for archivists (2011)
 
NEOOUG 2010 Oracle Data Integrator Presentation
NEOOUG 2010 Oracle Data Integrator PresentationNEOOUG 2010 Oracle Data Integrator Presentation
NEOOUG 2010 Oracle Data Integrator Presentation
 
Jsonsaga
JsonsagaJsonsaga
Jsonsaga
 
The JSON Saga
The JSON SagaThe JSON Saga
The JSON Saga
 
The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine Translation
 
ISOcat to LMF to TEI
ISOcat to LMF to TEIISOcat to LMF to TEI
ISOcat to LMF to TEI
 
Metadata Cloud
Metadata CloudMetadata Cloud
Metadata Cloud
 

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

KĂŒrzlich hochgeladen

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptxJonalynLegaspi2
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)
Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)
Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)lakshayb543
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxDhatriParmar
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxMichelleTuguinay1
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 

KĂŒrzlich hochgeladen (20)

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)
Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)
Visit to a blind student's school🧑‍🩯🧑‍🩯(community medicine)
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptxMan or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
Man or Manufactured_ Redefining Humanity Through Biopunk Narratives.pptx
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 

Language Tools for OCR with Katrien Depuydt

  • 1. Computer Lexica in OCR and Retrieval Katrien Depuydt (Instituut voor Nederlandse Lexicologie, Leiden)
  • 2.
  • 3. What is a computer lexicon? IMPACT <Demo Day BL, 12 July 2011>
  • 4.
  • 5. Dictionary XML (example) IMPACT <Demo Day BL, 12 July 2011>
  • 6. IMPACT <Demo Day BL, 12 July 2011>
  • 7.
  • 8. IMPACT <Demo Day BL, 12 July 2011>
  • 9. Lexica in IMPACT IMPACT <Demo Day BL, 12 July 2011>
  • 10.
  • 11. OCR lexicon: example IMPACT <Demo Day BL, 12 July 2011> 1550-1750 > 1900 song 820 rihte 818 theire 818 manye 818 sume 815 Do 814 Whiche 811 fyrst 811 while 811 Water 810 wt 809 shalbe 808 thingis 807 again 806 sona 806 wa 805 mode 804 work 802 between 801 law 799 moder 798 mis 798 softe 798 television 418 electronic 375 video 194 hormone 176 jazz 162 eco 142 software 136 vitamin 128 movie 121 taxi 113 isotopic 108 electronics 95 radar 86 basically 71 sabotage 71 homozygote 70 psychedelic 67 phonemic 66 insulin 64 zap 64 antibody 61 fungicidal 61
  • 12.
  • 13. IMPACT <Demo Day BL, 12 July 2011> <?xml version='1.0'?> <!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'> <lexicon> <lexical_entry><lemma_id>219490</lemma_id> < modern_lemma > aantuilen </modern_lemma> <gloss></gloss> <POS>VRB</POS> <ne_label></ne_label> <language_id></language_id> <portmanteau_lemma_id></portmanteau_lemma_id> <wordform><form_representation> <wordform_id>850026</wordform_id> < written_form > tuyld </written_form> <attestation><id>92141</id> <token_id></token_id> < quote >Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen: Sy acht het boertery, en tuyld daer weer op an , Vermits een Vrou niet op een Vrou verlieven kan,</quote> <derivation_id>0</derivation_id> <document_id>204</document_id> <start_pos>119</start_pos> <end_pos>124</end_pos> </attestation> </form_representation> </wordform>
  • 14. Tools for lexicon building and application of lexica IMPACT <Demo Day BL, 12 July 2011>
  • 15. Types variation (spelling, inflection
) IMPACT <Demo Day BL, 12 July 2011> uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk I werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weĂ«led II (patterns to predict variation) (a number are predictable with patterns, others need to be taken from a lexicon )
  • 16. Neil Fitzgerald, 7th July 2011
  • 17.
  • 18.
  • 19.
  • 20. A number of results for Dutch and German IMPACT <Demo Day BL, 12 July 2011>
  • 21. Ground truth data: Dutch IMPACT <Demo Day BL, 12 July 2011> Type and genre # words Gold Standard Book 300k Random Set Books 340k Random Set Staten Generaal (Legal Papers) 2.5M Gold Standard Staten Generaal 500k Gold Standard Newspapers 1 3.4M Gold Standard Newspapers 2 170k Random Set Newspapers 3.2M total 13.1M
  • 22. Lexicon coverage (1: ground truth books) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 46% 76% Core general lexicon 56% 84% 1 + 2 63% 89% Expansion with corpus material 78% 95%
  • 23. Lexicon coverage (2: GT newspapers 18 th -19 th C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 40% 83% Core general lexicon 41% 84% 1 + 2 51% 89% Expansion with corpus material 62% 95%
  • 24. Lexicon coverage (3: GT Staten Generaal 19 e C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 51% 89% Core general lexicon 47% 88% 1 + 2 58% 93% Expansion with corpus material 68% 97%
  • 25. Lexicon coverage (4: GT Staten Generaal 20 e C.) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 70% 93% Core general lexicon 66% 93% 1 + 2 76% 96% Expansion with corpus material 81% 98%
  • 26. Lexicon coverage (5: Genesis, 1637 bible) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 31% 61% Core lexicon 62% 83% 1 + 2 65% 89% Expansion with corpus material 87% 98.6%
  • 27. Lexicon coverage (6: P.C. Hooft, histories) IMPACT <Demo Day BL, 12 July 2011> Type coverage Token coverage Modern lexicon (e-Lex) 26% 67% Core lexicon 47% 88% 1 + 2 50% 90% Expansion with corpus material 58% 96%
  • 28.
  • 29. OCR results: word recognition rate IMPACT <Demo Day BL, 12 July 2011> Dataset With ABBYY internal Dutch lexicon With IMPACT lexicon for Dutch (case hyphenation) With IMPACT lexicon for Dutch (case hyphenation) + long S problem) DPO35 88.8% 90.9% 93,5 %
  • 30. An example: IMPACT <Demo Day BL, 12 July 2011> OCR at the beginning of the project: Results: A. De eerde was de gevaarlykflti om de verleiÂŹ ding aan 't Hof; de tweede de ftillie en veiligde ; de derde de zwaarde , daar hy byna drie millioenen harde en onbefchaafde Menfchen beftieren moest. A. De eerste was de gevaarlykste om de verlei- ding aan 't Hof; de tweede de stilste en veiligste; de derde de zwaarste, daar hy byna drie millioenen harde en onbeschaafde Menschen bestieren moest.
  • 31. IMPACT <Demo Day BL, 12 July 2011> Dictionary 16 th century No. of word errors Reduction of error rate 18 th century No. of word errors Reduction of error rate 19 th century No. of word errors Reduction of error rate No Lexicon 1306 - 827 - 2074 - Optimal Lexicon 756 42% 395 52% 612 70% Modern Lexicon 1096 16% 501 39% 888 57% W.Historical Lexicon 938 28% 481 42% 856 59% Modern + Virtual H.L. 1011 25% 480 42% 849 59%
  • 32.
  • 33.
  • 34. IMPACT <Demo Day BL, 12 July 2011> An indemnity shall be granted to the surfer
. 
 bikini 

  • 35.

Hinweis der Redaktion

  1. This presentation is based on how the INL works with language. A electronic dictionary is not what we need for OCR and simple retrieval but is introduced anyway because we can (and do) use our dictionaries for lexicon construction.
  2. This is what an XML-based electronic dictionary looks like.
  3. This is the XML of the Oxford English dictionary. The horizontal lines mark a place where part of the structure has been folded in.
  4. &lt;ed&gt; We need further explanation for what ‘lemma’, ‘part of speech’ and ‘morphology’ mean Lemma: headword, like in an ordinary dictionary the entry Morphology: morphological analysis is done for compounds and derivates: which parts are to be distinguished in a word, e.g. apple pie : apple + pie
  5. This is an little part of a computational lexicon (of a certain type; there are many types of computational lexica)
  6. &lt;ed&gt; again, unsure of what LEMMA means Be, was, am, is, etc. all forms of the same word BE (and that is an example of a lemma)
  7. Two types of variation, examples for Dutch from the lexicon
  8. To give an indication of possible spelling variants of the word ‘world’ for English, a screenshot from the OED online...
  9. These are some of the ways in which we are using Computer lexica as building blocks.
  10. The
  11. The
  12. The
  13. The
  14. The
  15. The
  16. These are results with a rather limited historical lexicon of German.
  17. Computational Natural Language Learning
  18. 322445 (vierde kolom middennin) 424979