Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
USE CASE STUDY
We investigated OCR-related issues of concrete queries on a
digital newspaper archive. We tried to better u...
Nächste SlideShare
Wird geladen in …5
×
Nächste SlideShare
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
Weiter
Herunterladen, um offline zu lesen und im Vollbildmodus anzuzeigen.

1

Teilen

Herunterladen, um offline zu lesen

Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Herunterladen, um offline zu lesen

Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCR- induced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved.

More details about our work can be found in our paper:
M. Traub, J. van Ossenbruggen, and L. Hardman,
Impact Analysis of OCR Quality on Research Tasks in Digital Archives,
TPDL2015, September, 2015, Poznan, Poland
https://repository.cwi.nl/noauth/search/fullrecord.php?publnr=23429

Ähnliche Bücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Ähnliche Hörbücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Impact Analysis of OCR Quality on Research Tasks in Digital Archives

  1. 1. USE CASE STUDY We investigated OCR-related issues of concrete queries on a digital newspaper archive. We tried to better understand the corpus and technical details on how it was created, how to estimate the impact of the OCR errors based on the confidence values assigned by the OCR engine, and what search strategies may yield better results. We found that not all details of the digitization pipeline are known by the collection owner. Scholars are therefore not given sufficient information to assess the reliability of their results. We believe that this is typical for large archives and libraries. M. Traub, J. van Ossenbruggen, and L. Hardman, Impact Analysis of OCR Quality on Research Tasks in Digital Archives, TPDL2015, September, 2015, Poznan, Poland INTERVIEWS Categorization of research tasks T1 find the first mention of a concept T2 find a subset with relevant documents T3 investigate quantitative results over time T3.a compare quantitative results for two terms T3.b compare quantitative results from two corpora T4 tasks using external tools on archive data IMPACT ANALYSIS OF OCR QUALITY ON RESEARCH TASKS IN DIGITAL ARCHIVES Myriam C. Traub, Jacco van Ossenbruggen, and Lynda Hardman Centrum Wiskunde & Informatica, Amsterdam The shift in research methodology within the digital humanities has the hidden cost of working with digitally processed historical documents: How much trust can a scholar place in noisy representations of source texts? To improve upon the current situation, we think the communities involved should begin to approach the problem from the scholars’ perspective. Humanity scholars need to transfer their valuable tradition of source criticism into the digital realm, and more openly criticize the potential limitations and biases of the digital tools we provide them with. We interviewed four humanities scholars to find out what 
 research tasks humanities scholars typically perform on digital archives. Our main findings were that scholars use digital archives mainly for exploration or to select items for close reading, that their interest in quantitative analyses is limited because of the perceived low OCR quality, and a categorization of common research tasks (see table below). LITERATURE STUDY Quality in digitization projects of libraries is mostly studied from the point of view of the collection owner, and not from the perspective of the scholar. User-centric studies on digital libraries typically focus on user interface design and other usability issues. Ample research is done on how to reduce the error rates of OCRed 
 text in a post-processing phase or how to improve OCR engines. None of the approaches aiming at improving data quality can remove all errors. Scholars still need to be able to quantify the remaining errors and assess their impact on their research tasks. Confusion Matrix OCR Confidence Values Alternative Confidence Values available: sample only full corpus not available T1 find all queries for x, impractical estimated precision, not helpful improve recall T2 as above estimated precision, requires improved UI improve recall T3 pattern summarized over set of alternative queries estimates of corrected precision estimates of corrected recall T3.a warn for different susceptibility to errors as above, warn for different distribution of confidence values as above T3.b as above as above as above The different types of tasks require different levels of quality. A confusion matrix or confidence values can be used to create a larger query set. Thereby, scholars may be able to better estimate the quality and also (to some extent) to compensate low quality.
  • MEMorrissey

    Jul. 21, 2015

Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCR- induced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved. More details about our work can be found in our paper: M. Traub, J. van Ossenbruggen, and L. Hardman, Impact Analysis of OCR Quality on Research Tasks in Digital Archives, TPDL2015, September, 2015, Poznan, Poland https://repository.cwi.nl/noauth/search/fullrecord.php?publnr=23429

Aufrufe

Aufrufe insgesamt

400

Auf Slideshare

0

Aus Einbettungen

0

Anzahl der Einbettungen

101

Befehle

Downloads

5

Geteilt

0

Kommentare

0

Likes

1

×