SlideShare a Scribd company logo
1 of 21
Download to read offline
Impact Analysis of OCR Quality on
ResearchTasks in Digital Archives
Myriam C. Traub, Jacco van Ossenbruggen, Lynda Hardman
Centrum Wiskunde & Informatica, Amsterdam
Context
✤ Research in collaboration with the
National Library of The
Netherlands
✤ Digital newspaper archive:
✤ 10 million pages covering 1618
to 1995
✤ +/- 1200 newspaper titles
✤ Available data: scanned image
of the page, OCRed text and
metadata records
2
Interviews
✤ Aim:
✤ Find out what types of
research tasks scholars
perform on digital archives
✤ Which quantitative / distant
reading tasks are not
(sufficiently) supported
✤ Scholars with experience in
performing historical research
on digital archives
3
Categorization of research tasks
T1 find the first mention of a concept
T2 find a subset with relevant documents
T3 investigate quantitative results over time
T3.a compare quantitative results for two terms
T3.b compare quantitative results from two corpora
T4 tasks using external tools on archive data
5
I mostly use digital archives for
exploration of a topic, selecting
material for close reading (T1, T2) or
external processing (T4).
OCR quality in digital archives /
libraries is partly very bad.
I cannot quantify its impact on my
research tasks.
I would not trust quantitative
analyses (T3a, T3b) based on this data
sufficiently to use it in publications.
Literature
✤ OCR quality is addressed from
the perspective of the collection
owner/OCR software developer
✤ Usability studies for digital
libraries
✤ Robustness of search engines
towards OCR errors
✤ Error removal in post-
processing either systematically
or intellectually
6
We care
about average
performance on
representative subsets
for generic cases.
I care about
actual performance
on my non-
representative subset
for my specific
query.
7
Two different perspectives of quality evaluation
Use case
✤ Aims:
✤ To study the impact on
research tasks in detail
✤ Identify starting points for
workarounds and/or further
research
✤ Tasks T1 - T3
8
T1: Finding the
first mention
✤ Key requirement: recall
✤ 100% recall is unrealistic
✤ Aim: Find out how a scholar
can assess the reliability of
results
9
“Amsterdam”
1642
10
First mention of …
… in the OCRed newspaper archive of the KB?
1618
earliest
document
O
C
R
pre-processing
post-processing
ingestion
scanning
11
Understanding potential sources
of bias and errors
✤ many details difficult to reconstruct
✤ essential to understand overall
impact
“Amsterdam”
1642
12
First mention of …
… in the OCRed newspaper archive of the KB?
1618
earliest
document
“Amfterdam”
1624
01
OCR confidence
values useful?
✤ Available for all items in the
collection: page, word,
character
✤ Only for highest ranked
words / characters, other
candidates missing
✤ This information would be
required to estimate recall.
13
Confusion table
✤ Applied frequent OCR
confusions to query
✤ 23 alternative spellings, but
none of them yielded an
earlier mention
✤ Problem: long tail
Amstcrdam 16-01-1743
Amstordam 01-08-1772
Amsttrdam 04-08-1705
Amslerdam 12-12-1673
Amslcrdam 20-06-1797
Amslordam 29-06-1813
Amsltrdam 13-04-1810
Amscerdam 17-10-1753
Amsccrdam 16-02-1816
Amscordam 01-11-1813
Amsctrdam 16-06-1823
Amfterdam already found
Amftcrdam 17-08-1644
Amftordam 31-01-1749
Amfttrdam 26-11-1675
Amflerdam 03-03-1629
Amflcrdam 01-03-1663
Amflordam 05-03-1723
Amfltrdam 01-09-1672
Amfcerdam 22-04-1700
Amfccrdam 27-11-1742
Amfcordam -
Amfctrdam 09-10-1880
correct confused
s f
n u
e c
n a
t l
t c
h b
l i
e o
e t
full table available online:
http://persistent-identifier.org/?identifier=urn:nbn:nl:ui:18-23429
“Amsterdam”
1642
“Amfterdam”
1624
“Amsterstam”
1619
15
First mention of …
1618
… in the OCRed newspaper archive of the KB?
earliest
document
“Amsterdam”
1642
“Amfterdam”
1624
“Amsterstam”
1619
16
Update!
1618
Corrections for 17th century newspapers were crowdsourced!
earliest
document
“Amsterdam”
1620
… but why not 1619?
Confusion Matrix OCR Confidence
Values
Alternative
Confidence
Values
available: sample only full corpus not available
T1 find all queries for x,
impractical
estimated precision, not
helpful
improve recall
T2 as above estimated precision,
requires improved UI
improve recall
T3 pattern summarized over
set of alternative queries
estimates of corrected
precision
estimates of
corrected recall
T3.a warn for different
susceptibility to errors
as above, warn for
different distribution of
confidence values
as above
T3.b as above as above as above
18
No silver bullet
✤ we propose novel strategies that solve
part of the problem:
✤ critical attitude
(awareness and better support)
✤ transparency
(provenance, open source,
documentation, …)
✤ alternative quality metrics
(taking research context into account)
19
Conclusions
Problems
✤ Scholars see OCR
quality as a serious
problem, but cannot
assess its impact
✤ OCR technology is
unlikely to be perfect
✤ OCR errors are
reported in terms of
averages measured
over representative
samples
✤ Impact on a specific
research task cannot
be assessed based on
average error metrics
Start of solutions
✤ Impact of OCR is
different for different
research tasks, so
these tasks need to
made be explicit
✤ OCR errors often
assumed to be
random but are often
partly systematic
✤ Tool pipelines and
their limitations need
to be transparent &
better documented
Translate the established tradition of source
criticism to the digital world and create a new
tradition of tool criticism to systematically
identify and explain technology-induced bias.
#toolcrit
21

More Related Content

Similar to Impact Analysis of OCR Quality on Research Tasks in Digital Archives

You Don't Have to Be a Data Scientist to Do Data Science
You Don't Have to Be a Data Scientist to Do Data ScienceYou Don't Have to Be a Data Scientist to Do Data Science
You Don't Have to Be a Data Scientist to Do Data ScienceCarmen Mardiros
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introductionguest0edcaf
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
 
201511-TIA_Presentation
201511-TIA_Presentation201511-TIA_Presentation
201511-TIA_Presentationhpcosta
 
Estimating IT projects - VU Amsterdam
Estimating IT projects - VU AmsterdamEstimating IT projects - VU Amsterdam
Estimating IT projects - VU AmsterdamFrank Vogelezang
 
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsTutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsYONG ZHENG
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalMounia Lalmas-Roelleke
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning AnalyticsXavier Ochoa
 
Pg. 01Question Three Assignment 1Deadline Satur.docx
Pg. 01Question Three Assignment 1Deadline Satur.docxPg. 01Question Three Assignment 1Deadline Satur.docx
Pg. 01Question Three Assignment 1Deadline Satur.docxkarlhennesey
 
Towards the Characterization of Realistic Models: Evaluation of Multidiscipli...
Towards the Characterization of Realistic Models: Evaluation of Multidiscipli...Towards the Characterization of Realistic Models: Evaluation of Multidiscipli...
Towards the Characterization of Realistic Models: Evaluation of Multidiscipli...Gábor Szárnyas
 
Discovery Tools for Open Access Repositories: A Literature Mapping
Discovery Tools for Open Access Repositories: A Literature MappingDiscovery Tools for Open Access Repositories: A Literature Mapping
Discovery Tools for Open Access Repositories: A Literature MappingGrial - University of Salamanca
 

Similar to Impact Analysis of OCR Quality on Research Tasks in Digital Archives (20)

DMDW Unit 1.pdf
DMDW Unit 1.pdfDMDW Unit 1.pdf
DMDW Unit 1.pdf
 
Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 
You Don't Have to Be a Data Scientist to Do Data Science
You Don't Have to Be a Data Scientist to Do Data ScienceYou Don't Have to Be a Data Scientist to Do Data Science
You Don't Have to Be a Data Scientist to Do Data Science
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
201511-TIA_Presentation
201511-TIA_Presentation201511-TIA_Presentation
201511-TIA_Presentation
 
Estimating IT projects - VU Amsterdam
Estimating IT projects - VU AmsterdamEstimating IT projects - VU Amsterdam
Estimating IT projects - VU Amsterdam
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender SystemsTutorial: Context-awareness In Information Retrieval and Recommender Systems
Tutorial: Context-awareness In Information Retrieval and Recommender Systems
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
An introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information RetrievalAn introduction to system-oriented evaluation in Information Retrieval
An introduction to system-oriented evaluation in Information Retrieval
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
Pg. 01Question Three Assignment 1Deadline Satur.docx
Pg. 01Question Three Assignment 1Deadline Satur.docxPg. 01Question Three Assignment 1Deadline Satur.docx
Pg. 01Question Three Assignment 1Deadline Satur.docx
 
Towards the Characterization of Realistic Models: Evaluation of Multidiscipli...
Towards the Characterization of Realistic Models: Evaluation of Multidiscipli...Towards the Characterization of Realistic Models: Evaluation of Multidiscipli...
Towards the Characterization of Realistic Models: Evaluation of Multidiscipli...
 
The Ground Truth: Arabic Scientific Manuscripts Workshop
The Ground Truth: Arabic Scientific Manuscripts WorkshopThe Ground Truth: Arabic Scientific Manuscripts Workshop
The Ground Truth: Arabic Scientific Manuscripts Workshop
 
Creating Data Management Plans with CORA.eiNa DMP
Creating Data Management Plans with CORA.eiNa DMPCreating Data Management Plans with CORA.eiNa DMP
Creating Data Management Plans with CORA.eiNa DMP
 
Discovery Tools for Open Access Repositories: A Literature Mapping
Discovery Tools for Open Access Repositories: A Literature MappingDiscovery Tools for Open Access Repositories: A Literature Mapping
Discovery Tools for Open Access Repositories: A Literature Mapping
 
Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive A...
Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive A...Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive A...
Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive A...
 

More from Myriam Traub

Impact of Crowdsourcing OCR Improvements on Retrievability Bias
Impact of Crowdsourcing OCR Improvements  on Retrievability Bias Impact of Crowdsourcing OCR Improvements  on Retrievability Bias
Impact of Crowdsourcing OCR Improvements on Retrievability Bias Myriam Traub
 
Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus
Querylog-based Assessment of Retrievability Bias in a  Large Newspaper CorpusQuerylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus
Querylog-based Assessment of Retrievability Bias in a Large Newspaper CorpusMyriam Traub
 
Effectiveness of Gamesourcing Expert Painting Annotations
Effectiveness of Gamesourcing Expert Painting AnnotationsEffectiveness of Gamesourcing Expert Painting Annotations
Effectiveness of Gamesourcing Expert Painting AnnotationsMyriam Traub
 
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool CriticismThe Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool CriticismMyriam Traub
 
Measuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Measuring the Effectiveness of Gamesourcing Expert Oil Painting AnnotationsMeasuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Measuring the Effectiveness of Gamesourcing Expert Oil Painting AnnotationsMyriam Traub
 

More from Myriam Traub (6)

Impact of Crowdsourcing OCR Improvements on Retrievability Bias
Impact of Crowdsourcing OCR Improvements  on Retrievability Bias Impact of Crowdsourcing OCR Improvements  on Retrievability Bias
Impact of Crowdsourcing OCR Improvements on Retrievability Bias
 
Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus
Querylog-based Assessment of Retrievability Bias in a  Large Newspaper CorpusQuerylog-based Assessment of Retrievability Bias in a  Large Newspaper Corpus
Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus
 
Effectiveness of Gamesourcing Expert Painting Annotations
Effectiveness of Gamesourcing Expert Painting AnnotationsEffectiveness of Gamesourcing Expert Painting Annotations
Effectiveness of Gamesourcing Expert Painting Annotations
 
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool CriticismThe Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
The Nature Of Digitally-Produced Data: Towards Social-Scientific Tool Criticism
 
Tool Criticism
Tool CriticismTool Criticism
Tool Criticism
 
Measuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Measuring the Effectiveness of Gamesourcing Expert Oil Painting AnnotationsMeasuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
Measuring the Effectiveness of Gamesourcing Expert Oil Painting Annotations
 

Recently uploaded

Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Silpa
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...Monika Rani
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusNazaninKarimi6
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxSilpa
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professormuralinath2
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.Silpa
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Silpa
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxANSARKHAN96
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....muralinath2
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Serviceshivanisharma5244
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 

Recently uploaded (20)

Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
Human & Veterinary Respiratory Physilogy_DR.E.Muralinath_Associate Professor....
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 

Impact Analysis of OCR Quality on Research Tasks in Digital Archives

  • 1. Impact Analysis of OCR Quality on ResearchTasks in Digital Archives Myriam C. Traub, Jacco van Ossenbruggen, Lynda Hardman Centrum Wiskunde & Informatica, Amsterdam
  • 2. Context ✤ Research in collaboration with the National Library of The Netherlands ✤ Digital newspaper archive: ✤ 10 million pages covering 1618 to 1995 ✤ +/- 1200 newspaper titles ✤ Available data: scanned image of the page, OCRed text and metadata records 2
  • 3. Interviews ✤ Aim: ✤ Find out what types of research tasks scholars perform on digital archives ✤ Which quantitative / distant reading tasks are not (sufficiently) supported ✤ Scholars with experience in performing historical research on digital archives 3
  • 4. Categorization of research tasks T1 find the first mention of a concept T2 find a subset with relevant documents T3 investigate quantitative results over time T3.a compare quantitative results for two terms T3.b compare quantitative results from two corpora T4 tasks using external tools on archive data
  • 5. 5 I mostly use digital archives for exploration of a topic, selecting material for close reading (T1, T2) or external processing (T4). OCR quality in digital archives / libraries is partly very bad. I cannot quantify its impact on my research tasks. I would not trust quantitative analyses (T3a, T3b) based on this data sufficiently to use it in publications.
  • 6. Literature ✤ OCR quality is addressed from the perspective of the collection owner/OCR software developer ✤ Usability studies for digital libraries ✤ Robustness of search engines towards OCR errors ✤ Error removal in post- processing either systematically or intellectually 6
  • 7. We care about average performance on representative subsets for generic cases. I care about actual performance on my non- representative subset for my specific query. 7 Two different perspectives of quality evaluation
  • 8. Use case ✤ Aims: ✤ To study the impact on research tasks in detail ✤ Identify starting points for workarounds and/or further research ✤ Tasks T1 - T3 8
  • 9. T1: Finding the first mention ✤ Key requirement: recall ✤ 100% recall is unrealistic ✤ Aim: Find out how a scholar can assess the reliability of results 9
  • 10. “Amsterdam” 1642 10 First mention of … … in the OCRed newspaper archive of the KB? 1618 earliest document
  • 11. O C R pre-processing post-processing ingestion scanning 11 Understanding potential sources of bias and errors ✤ many details difficult to reconstruct ✤ essential to understand overall impact
  • 12. “Amsterdam” 1642 12 First mention of … … in the OCRed newspaper archive of the KB? 1618 earliest document “Amfterdam” 1624
  • 13. 01 OCR confidence values useful? ✤ Available for all items in the collection: page, word, character ✤ Only for highest ranked words / characters, other candidates missing ✤ This information would be required to estimate recall. 13
  • 14. Confusion table ✤ Applied frequent OCR confusions to query ✤ 23 alternative spellings, but none of them yielded an earlier mention ✤ Problem: long tail Amstcrdam 16-01-1743 Amstordam 01-08-1772 Amsttrdam 04-08-1705 Amslerdam 12-12-1673 Amslcrdam 20-06-1797 Amslordam 29-06-1813 Amsltrdam 13-04-1810 Amscerdam 17-10-1753 Amsccrdam 16-02-1816 Amscordam 01-11-1813 Amsctrdam 16-06-1823 Amfterdam already found Amftcrdam 17-08-1644 Amftordam 31-01-1749 Amfttrdam 26-11-1675 Amflerdam 03-03-1629 Amflcrdam 01-03-1663 Amflordam 05-03-1723 Amfltrdam 01-09-1672 Amfcerdam 22-04-1700 Amfccrdam 27-11-1742 Amfcordam - Amfctrdam 09-10-1880 correct confused s f n u e c n a t l t c h b l i e o e t full table available online: http://persistent-identifier.org/?identifier=urn:nbn:nl:ui:18-23429
  • 15. “Amsterdam” 1642 “Amfterdam” 1624 “Amsterstam” 1619 15 First mention of … 1618 … in the OCRed newspaper archive of the KB? earliest document
  • 16. “Amsterdam” 1642 “Amfterdam” 1624 “Amsterstam” 1619 16 Update! 1618 Corrections for 17th century newspapers were crowdsourced! earliest document “Amsterdam” 1620
  • 17. … but why not 1619?
  • 18. Confusion Matrix OCR Confidence Values Alternative Confidence Values available: sample only full corpus not available T1 find all queries for x, impractical estimated precision, not helpful improve recall T2 as above estimated precision, requires improved UI improve recall T3 pattern summarized over set of alternative queries estimates of corrected precision estimates of corrected recall T3.a warn for different susceptibility to errors as above, warn for different distribution of confidence values as above T3.b as above as above as above 18
  • 19. No silver bullet ✤ we propose novel strategies that solve part of the problem: ✤ critical attitude (awareness and better support) ✤ transparency (provenance, open source, documentation, …) ✤ alternative quality metrics (taking research context into account) 19
  • 20. Conclusions Problems ✤ Scholars see OCR quality as a serious problem, but cannot assess its impact ✤ OCR technology is unlikely to be perfect ✤ OCR errors are reported in terms of averages measured over representative samples ✤ Impact on a specific research task cannot be assessed based on average error metrics Start of solutions ✤ Impact of OCR is different for different research tasks, so these tasks need to made be explicit ✤ OCR errors often assumed to be random but are often partly systematic ✤ Tool pipelines and their limitations need to be transparent & better documented
  • 21. Translate the established tradition of source criticism to the digital world and create a new tradition of tool criticism to systematically identify and explain technology-induced bias. #toolcrit 21