SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR in libraries – some practical remarks

                          Günter Mühlberger
                          Department for Digitisation and Digital Preservation
                          University Innsbruck Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR in Libraries
       Not an easy chapter...
       Is the glass half empty or half full?
       Historical fonts: Black letter, gothic, Old Cyrillic, ...
       Great attempts for full-text
          – JSTOR (1994)
          – Google (2004)
 But: Still many digital libraries without integrated full-text
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR and Digitization
 OCR changes everything!
 Workflow has to be adopted at all steps
          –      Preparation and selection of material
          –      Image processing & scanning
          –      Quality control
          –      Storage and preservation
          –      Correction and user involvement
          –      Full-text search
          –      Web interfaces for digital libraries
 Significant increase in complexity
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Preparation
 Which material will be taken for scanning? Options:
          – Bound volumes?
          – Microfilm?
          – Loose folios?
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Option: Bound volumes
 Bound volumes
          – Pros:
                      That’s the way books/journals/newspapers are in the library
          – Cons:
                      Often narrow binding, especially with newspapers
                      Often warping due to humidity
          – Remark
                      Technical solution: ScanRobots make life easier and double the speed
                       compared to manual interaction, e.g. 700 – 1000 pages per hour
                      Investment for ScanRobots must not be underestimated




                                                                                                                                                         15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Option: Microfilm
 Microfilm
          – Pros:
                      If a microfilm is available it is a cheap alternative
                      Easy option (no handling of volumes)
          – Cons:
                      Microfilms have the same problems as bound volumes
                      Microfilms were often produced with minimum quality control
                      Microfilms before 1990 are often not in a good condition
 Remark
          – If microfilm was produced with good quality than there is no significant
            difference in the OCR quality
                      Case study with BL material will be published on IMPACT site



                                                                                                                                                         16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Option: Loose folios
 Pros
          – No narrow binding, less warping
          – Extremely fast performance with industry scanners – low price
          – Duplicates can be sent to off-shore providers in huge packages
 Cons
          – Not feasible for material before 1850 – libraries would run into justification problems
          – Organisational effort to organise duplicates (but completeness has to be evaluated
            anyway)
 Remark
          – By far the best option to produce high quality with the lowest resources
          – Especially interesting for newspapers, 20th century material and grey literature
          – Used e.g. by MOA, JSTOR

                                                                                                                                                         17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Good, bad and ugly images
 Careful scanning is A and O
          – Scanrobots and document scanners lower the requirements for a good
            operator, but still individual capability is decisive
 Criteria for a good page image are simple:
          –      sharp
          –      significant fonts with clear curves
          –      clear background, no shining through from the backside
          –      no warping of the page and no geometrical distortions
          –      complete shot with some white frame around the text borders
          –      lines to be parallel resp. rectangle to borders
          –      no noise of users
 If you have perfect images you can wait until OCR technology
  improves, with bad images you never get good results
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         22
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Bad print – broken characters
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




   und                                                                              wenn

                                                                                                                                                         24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         25
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Bitonal or 8/24 Bit – 300 or 400 ppi – JPEG or TIFF?
 Bitonal vs. 8/24 bit
          – Rose Holley: Dlib Paper 2009: Grey scanning does not lead to better results
          – Experiment: Microfilm scanned bitonal or greyscale – no difference
 Simple experiments show the opposite
          – Innsbrucker Zeitungsarchiv: bitonal and 24 bit
          – Results are clearly better with colour
 300 or 400 Resolution
          – Very small font: Word text: 4 point font
 JPEG vs. TIFF RGB
          – Tests with the Treventus ScanRobot but also with other material show that
            there is no advantage of TIFF RGB images compared to compressed
            JPEGs
 Modern documents with medium sized fonts can be scanned with 300
  ppi and bitonal, but documents with small fonts and challenging paper
  quality etc. should be scanned with 400 ppi and 8 or 24 bit and can be
  stored as JPEGs with e.g. 90% compression rate
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Accuracy
 Is the glas half full or half empty?
          – Rose Holley <90% word recognition: Poor result
          – Google: OCR every image, so every correctly recognized word is better
            than nothing
          – Painful errors?
          – Mature users?


 Character vs. word accuracy
          – Word accuracy says much more, and is much easier to gain: Each word
            which would be correctly found in a full-text search, can be counted as
            correct.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Examples from real world projects
 Based on: ABBYY Recognition Server 2
          –      Reichstagsprotokolle, 1925
          –      Zedler, 1744
          –      Coburger Zeitung, 1808
          –      Judentum, 1803
          –      Eckartshausen, 1792
          –      Landesbauernkammer, 1921
          –      Galvani, 1793
          –      Hieber, 1722
          –      Hofmann, 1875
          –      Buschendorf, 1805
          –      Schreiben, 1689
          –      Lateinische Texte
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Correction of OCR text
 Until recently regarded as „absurd“
 But:
          – Crowd sourcing
          – New technologies
 Crowd sourcing
          –      Figures from Austrialian Newspaper Project:
          –      Correction via a simple editor: line by line correctioin
          –      Since August 2008 6000 users contributed
          –      7 Mill. lines in 318.000 articles were corrected
          –      If you count 50 characters per line it is worth about 200.000 EUR (=
                 compared to the prices of service providers)
 New technologies
          – IBM: CONCERT Tool, LMU: PostCorrection Tool
          – Productivity compared to simple rekeying will be enhanced by several
            factors (at least 1:5)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




What to do with OCR results?
 Structural enhancement
          – INEX: competition based on OCR files
          – Functional Extension Parser
 Preservation
          –      Complexity is significantly increased
          –      Output: TXT, PDF, ABBYY XML
          –      ALTO Format
          –      How to integrated corrective actions of users?
          –      Proposition for enhancing ALTO format
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Digital library applications
 Fulltext search
          – JSTOR, Google, publishers
          – Facetted Search (SOLR)
 Indexing through search engines
          – Site XML
 Visibility of the OCR text
          – User training (by doing)
          – Necessary if correction shall be included
 New research fields
          – Text mining
          – Linking of texts
          – Near duplicates, similiarity and new identifiers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Summary
 OCR is a „must“
          – For documents of the 19. and 20th century OCR provides in general
            useful or even very good results
          – Bevore 1800: Improvements can be expected by IMPACT
          – Careful and exact scanning is always the main prerequisite, preferable
            in 400 ppi and 8 or 24 bit
          – Test runs with random sets
 Modern applications
          –      Fulltext search
          –      Visibility of the erroneous text
          –      Options for correcting the text by users
          –      Several export formats (also for end-users)
          –      Site XML for search engines
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                Thank you for your attention!

Weitere ähnliche Inhalte

Mehr von IMPACT Centre of Competence

Mehr von IMPACT Centre of Competence (20)

Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 
Session1 02.anna-maria sichani
Session1 02.anna-maria sichaniSession1 02.anna-maria sichani
Session1 02.anna-maria sichani
 
Session1 01.konstantin baierer
Session1 01.konstantin baiererSession1 01.konstantin baierer
Session1 01.konstantin baierer
 
Advanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slidesAdvanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slides
 
Xii simposi internacional noves tendencies
Xii simposi internacional noves tendenciesXii simposi internacional noves tendencies
Xii simposi internacional noves tendencies
 

Kürzlich hochgeladen

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
SoniaTolstoy
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Krashi Coaching
 

Kürzlich hochgeladen (20)

Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 

Bratislava WS - Mühlberger - OCR in libraries_pdf

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR in libraries – some practical remarks Günter Mühlberger Department for Digitisation and Digital Preservation University Innsbruck Library
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR in Libraries  Not an easy chapter...  Is the glass half empty or half full?  Historical fonts: Black letter, gothic, Old Cyrillic, ...  Great attempts for full-text – JSTOR (1994) – Google (2004)  But: Still many digital libraries without integrated full-text
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR and Digitization  OCR changes everything!  Workflow has to be adopted at all steps – Preparation and selection of material – Image processing & scanning – Quality control – Storage and preservation – Correction and user involvement – Full-text search – Web interfaces for digital libraries  Significant increase in complexity
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Preparation  Which material will be taken for scanning? Options: – Bound volumes? – Microfilm? – Loose folios?
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Bound volumes  Bound volumes – Pros:  That’s the way books/journals/newspapers are in the library – Cons:  Often narrow binding, especially with newspapers  Often warping due to humidity – Remark  Technical solution: ScanRobots make life easier and double the speed compared to manual interaction, e.g. 700 – 1000 pages per hour  Investment for ScanRobots must not be underestimated 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Microfilm  Microfilm – Pros:  If a microfilm is available it is a cheap alternative  Easy option (no handling of volumes) – Cons:  Microfilms have the same problems as bound volumes  Microfilms were often produced with minimum quality control  Microfilms before 1990 are often not in a good condition  Remark – If microfilm was produced with good quality than there is no significant difference in the OCR quality  Case study with BL material will be published on IMPACT site 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Loose folios  Pros – No narrow binding, less warping – Extremely fast performance with industry scanners – low price – Duplicates can be sent to off-shore providers in huge packages  Cons – Not feasible for material before 1850 – libraries would run into justification problems – Organisational effort to organise duplicates (but completeness has to be evaluated anyway)  Remark – By far the best option to produce high quality with the lowest resources – Especially interesting for newspapers, 20th century material and grey literature – Used e.g. by MOA, JSTOR 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Good, bad and ugly images  Careful scanning is A and O – Scanrobots and document scanners lower the requirements for a good operator, but still individual capability is decisive  Criteria for a good page image are simple: – sharp – significant fonts with clear curves – clear background, no shining through from the backside – no warping of the page and no geometrical distortions – complete shot with some white frame around the text borders – lines to be parallel resp. rectangle to borders – no noise of users  If you have perfect images you can wait until OCR technology improves, with bad images you never get good results
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 21
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 22
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Bad print – broken characters
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. und wenn 24
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 25
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Bitonal or 8/24 Bit – 300 or 400 ppi – JPEG or TIFF?  Bitonal vs. 8/24 bit – Rose Holley: Dlib Paper 2009: Grey scanning does not lead to better results – Experiment: Microfilm scanned bitonal or greyscale – no difference  Simple experiments show the opposite – Innsbrucker Zeitungsarchiv: bitonal and 24 bit – Results are clearly better with colour  300 or 400 Resolution – Very small font: Word text: 4 point font  JPEG vs. TIFF RGB – Tests with the Treventus ScanRobot but also with other material show that there is no advantage of TIFF RGB images compared to compressed JPEGs  Modern documents with medium sized fonts can be scanned with 300 ppi and bitonal, but documents with small fonts and challenging paper quality etc. should be scanned with 400 ppi and 8 or 24 bit and can be stored as JPEGs with e.g. 90% compression rate
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Accuracy  Is the glas half full or half empty? – Rose Holley <90% word recognition: Poor result – Google: OCR every image, so every correctly recognized word is better than nothing – Painful errors? – Mature users?  Character vs. word accuracy – Word accuracy says much more, and is much easier to gain: Each word which would be correctly found in a full-text search, can be counted as correct.
  • 28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Examples from real world projects  Based on: ABBYY Recognition Server 2 – Reichstagsprotokolle, 1925 – Zedler, 1744 – Coburger Zeitung, 1808 – Judentum, 1803 – Eckartshausen, 1792 – Landesbauernkammer, 1921 – Galvani, 1793 – Hieber, 1722 – Hofmann, 1875 – Buschendorf, 1805 – Schreiben, 1689 – Lateinische Texte
  • 29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Correction of OCR text  Until recently regarded as „absurd“  But: – Crowd sourcing – New technologies  Crowd sourcing – Figures from Austrialian Newspaper Project: – Correction via a simple editor: line by line correctioin – Since August 2008 6000 users contributed – 7 Mill. lines in 318.000 articles were corrected – If you count 50 characters per line it is worth about 200.000 EUR (= compared to the prices of service providers)  New technologies – IBM: CONCERT Tool, LMU: PostCorrection Tool – Productivity compared to simple rekeying will be enhanced by several factors (at least 1:5)
  • 30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. What to do with OCR results?  Structural enhancement – INEX: competition based on OCR files – Functional Extension Parser  Preservation – Complexity is significantly increased – Output: TXT, PDF, ABBYY XML – ALTO Format – How to integrated corrective actions of users? – Proposition for enhancing ALTO format
  • 31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Digital library applications  Fulltext search – JSTOR, Google, publishers – Facetted Search (SOLR)  Indexing through search engines – Site XML  Visibility of the OCR text – User training (by doing) – Necessary if correction shall be included  New research fields – Text mining – Linking of texts – Near duplicates, similiarity and new identifiers
  • 32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Summary  OCR is a „must“ – For documents of the 19. and 20th century OCR provides in general useful or even very good results – Bevore 1800: Improvements can be expected by IMPACT – Careful and exact scanning is always the main prerequisite, preferable in 400 ppi and 8 or 24 bit – Test runs with random sets  Modern applications – Fulltext search – Visibility of the erroneous text – Options for correcting the text by users – Several export formats (also for end-users) – Site XML for search engines
  • 33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you for your attention!