SlideShare ist ein Scribd-Unternehmen logo
1 von 35
TR5 Profiler and Post-Correction System  Ludwig-Maximilians-Universität München Centrum für Informations- und Sprachverarbeitung
TR5 Post-Correction System ,[object Object],[object Object],[object Object]
Customizable user interface ,[object Object],[object Object],[object Object],[object Object],OCR and image fragments Correction candidates, Special functions Complete image Font size
[object Object],[object Object],View: OCR and Image clippings
[object Object],[object Object],[object Object],View: Original image
[object Object],[object Object],[object Object],Word by word correction of text
[object Object],[object Object],Batch correction: efficient postcorrection
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Batch correction: efficient postcorrection
Postcorrection system: Evaluation Ulrich Reffle, 4, Juli 2011 ,[object Object],[object Object]
Korrektursystem
Korrektursystem
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Why another postcorrection system?
[object Object],[object Object],[object Object],[object Object],Underlying language technology
Text and error profiles ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],patterns OCR errors
Historical variant and OCR error patterns Historical Variants OCR Error patterns teil    theil theil    iheil
Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’  Absolute frequency: Pattern was found 120 times in the current document.
[object Object],[object Object],Occurrence  of spelling variant “i->y”: Occurrence  of ocr error “ i->y”:
[object Object],[object Object],Occurrences of spelling variant “i->y”: +0.999771 Occurrences of ocr error “ i->y”: +0.000224948
Computation of profile: initialization OCR result w 0 , w 1  ,w 2 , w 3 , … Initial global profile ,[object Object],[object Object],[object Object],[object Object]
Computation of profile: global to local w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Ulrich Reffle, 4, Juli 2011 w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Local profile Initial global profile OCR result w 0 , w 1  ,w 2 , w 3 , … ,[object Object],[object Object],[object Object],[object Object]
Computation of profile: local to global w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Ulrich Reffle, 4, Juli 2011 w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … Local profile Global profile OCR result w 0 , w 1  ,w 2 , w 3 , … ,[object Object],[object Object],[object Object],[object Object]
Computation of profile: iteration Ulrich Reffle, 4, Juli 2011 Local profile Global profile w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 3 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 2 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 1 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … w 0 : … -> … -> … … -> … -> … … -> … -> … … -> … -> … OCR result w 0 , w 1  ,w 2 , w 3 , … ,[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Evaluation: Measures (1)  Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two Values: Historical Patterns, OCR Patterns (2)  OCR Error Detection   Precision and Recall for the OCR errors detected by the Profiler (3)  Indirect evaluation (For instance, by means of the postcorrection system)
Evaluation: Data preparation (1)  Deep Evaluation: For each token of the evaluation document the historical interpretation and the  OCR  interpretation have been manually annotated.  ++ fully accurate  -- manual work (2)  Shallow Evaluation:  The OCR’ed document is automatically aligned with its re-typed ground truth; For each token of the evaluation document  the historical and the OCR interpretation is automatically assigned from the ground truth. ++ no manual work  – not completely accurate
Evaluation: Data Deep:  Eckartshausen  100 pages  Briefkunst  40 pages Shallow:  5 books each,  16 th , 17 th  and 18 th  century
Evaluation: Eckartshausen ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Graphical Evaluation: Eckartshausen
Graphical Evaluation: diacritics Hist. Var. OCR
Shallow Evaluation Results 16th  17th 18th HIST Patterns  first 10 60% 74% 78% OCR Patterns  first 10 48% 70% 50% Error Detection Prec 95% 92% 81% Error Detection Recall 49% 43% 45% Content Words Errors 64% 44% 16% Easy Interactive Correction  per 10,000 words ≈ 3000 words ≈  1892 words ≈  720 words
Global Profile: Spelling variation patterns
Spelling variation profile
OCR Error Profile
 

Weitere ähnliche Inhalte

Ähnlich wie Bne demoday postcorrection_and_profiler

Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...
Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...
Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...IMPACT Centre of Competence
 
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...🎤 Hanno Embregts 🎸
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character RecognitionRahul Mallik
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Karan Panjwani
 
Evaluating Google Cloud Vision for OCR
Evaluating Google Cloud Vision for OCREvaluating Google Cloud Vision for OCR
Evaluating Google Cloud Vision for OCRShin Hashitani
 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyEr. Ashish Pandey
 
Design and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontDesign and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontIRJET Journal
 
Online Hand Written Character Recognition
Online Hand Written Character RecognitionOnline Hand Written Character Recognition
Online Hand Written Character RecognitionIOSR Journals
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Editor IJARCET
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Editor IJARCET
 
IRJET-Optical Character Recognition using ANN
IRJET-Optical Character Recognition using ANNIRJET-Optical Character Recognition using ANN
IRJET-Optical Character Recognition using ANNIRJET Journal
 
Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"Fwdays
 
Entering the Fourth Dimension of OCR with Tesseract
Entering the Fourth Dimension of OCR with TesseractEntering the Fourth Dimension of OCR with Tesseract
Entering the Fourth Dimension of OCR with Tesseract🎤 Hanno Embregts 🎸
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Maté Ongenaert
 

Ähnlich wie Bne demoday postcorrection_and_profiler (20)

IMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf TzadokIMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf Tzadok
 
Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...
Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...
Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...
 
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
Entering the Fourth Dimension of OCR with Tesseract - Talk from Voxxed Days B...
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
 
Odp
OdpOdp
Odp
 
Evaluating Google Cloud Vision for OCR
Evaluating Google Cloud Vision for OCREvaluating Google Cloud Vision for OCR
Evaluating Google Cloud Vision for OCR
 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper Study
 
Hardware to Software
Hardware to SoftwareHardware to Software
Hardware to Software
 
Practically genius1
Practically genius1Practically genius1
Practically genius1
 
Cpcs302 1
Cpcs302  1Cpcs302  1
Cpcs302 1
 
Design and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English FontDesign and Description of Feature Extraction Algorithm for Old English Font
Design and Description of Feature Extraction Algorithm for Old English Font
 
Online Hand Written Character Recognition
Online Hand Written Character RecognitionOnline Hand Written Character Recognition
Online Hand Written Character Recognition
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
 
IRJET-Optical Character Recognition using ANN
IRJET-Optical Character Recognition using ANNIRJET-Optical Character Recognition using ANN
IRJET-Optical Character Recognition using ANN
 
Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"Evgen Terpil "OCR in the Wild World of Social Media"
Evgen Terpil "OCR in the Wild World of Social Media"
 
Entering the Fourth Dimension of OCR with Tesseract
Entering the Fourth Dimension of OCR with TesseractEntering the Fourth Dimension of OCR with Tesseract
Entering the Fourth Dimension of OCR with Tesseract
 
06 traub
06 traub06 traub
06 traub
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Kürzlich hochgeladen

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Kürzlich hochgeladen (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Bne demoday postcorrection_and_profiler

  • 1. TR5 Profiler and Post-Correction System Ludwig-Maximilians-Universität München Centrum für Informations- und Sprachverarbeitung
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. Historical variant and OCR error patterns Historical Variants OCR Error patterns teil  theil theil  iheil
  • 17. Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’ Absolute frequency: Pattern was found 120 times in the current document.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25. Evaluation: Measures (1) Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two Values: Historical Patterns, OCR Patterns (2) OCR Error Detection Precision and Recall for the OCR errors detected by the Profiler (3) Indirect evaluation (For instance, by means of the postcorrection system)
  • 26. Evaluation: Data preparation (1) Deep Evaluation: For each token of the evaluation document the historical interpretation and the OCR interpretation have been manually annotated. ++ fully accurate -- manual work (2) Shallow Evaluation: The OCR’ed document is automatically aligned with its re-typed ground truth; For each token of the evaluation document the historical and the OCR interpretation is automatically assigned from the ground truth. ++ no manual work – not completely accurate
  • 27. Evaluation: Data Deep: Eckartshausen 100 pages Briefkunst 40 pages Shallow: 5 books each, 16 th , 17 th and 18 th century
  • 28.
  • 31. Shallow Evaluation Results 16th 17th 18th HIST Patterns first 10 60% 74% 78% OCR Patterns first 10 48% 70% 50% Error Detection Prec 95% 92% 81% Error Detection Recall 49% 43% 45% Content Words Errors 64% 44% 16% Easy Interactive Correction per 10,000 words ≈ 3000 words ≈ 1892 words ≈ 720 words
  • 32. Global Profile: Spelling variation patterns
  • 35.  

Hinweis der Redaktion

  1. DictModule name=“modern” File=“../dicts/modern.dic” max_ocr_errors=3 max_spelling_variants
  2. DictModule name=“modern” File=“../dicts/modern.dic” max_ocr_errors=3 max_spelling_variants
  3. Cartapacio: map Jacarandina: slang