SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR in libraries – some practical remarks

                          Günter Mühlberger
                          Department for Digitisation and Digital Preservation
                          University Innsbruck Library
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR in Libraries
       Not an easy chapter...
       Is the glass half empty or half full?
       Historical fonts: Black letter, gothic, Old Cyrillic, ...
       Great attempts for full-text
          – JSTOR (1994)
          – Google (2004)
 But: Still many digital libraries without integrated full-text
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR and Digitization
 OCR changes everything!
 Workflow has to be adopted at all steps
          –      Preparation and selection of material
          –      Image processing & scanning
          –      Quality control
          –      Storage and preservation
          –      Correction and user involvement
          –      Full-text search
          –      Web interfaces for digital libraries
 Significant increase in complexity
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Preparation
 Which material will be taken for scanning? Options:
          – Bound volumes?
          – Microfilm?
          – Loose folios?
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Option: Bound volumes
 Bound volumes
          – Pros:
                      That’s the way books/journals/newspapers are in the library
          – Cons:
                      Often narrow binding, especially with newspapers
                      Often warping due to humidity
          – Remark
                      Technical solution: ScanRobots make life easier and double the speed
                       compared to manual interaction, e.g. 700 – 1000 pages per hour
                      Investment for ScanRobots must not be underestimated




                                                                                                                                                         15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Option: Microfilm
 Microfilm
          – Pros:
                      If a microfilm is available it is a cheap alternative
                      Easy option (no handling of volumes)
          – Cons:
                      Microfilms have the same problems as bound volumes
                      Microfilms were often produced with minimum quality control
                      Microfilms before 1990 are often not in a good condition
 Remark
          – If microfilm was produced with good quality than there is no significant
            difference in the OCR quality
                      Case study with BL material will be published on IMPACT site



                                                                                                                                                         16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Option: Loose folios
 Pros
          – No narrow binding, less warping
          – Extremely fast performance with industry scanners – low price
          – Duplicates can be sent to off-shore providers in huge packages
 Cons
          – Not feasible for material before 1850 – libraries would run into justification problems
          – Organisational effort to organise duplicates (but completeness has to be evaluated
            anyway)
 Remark
          – By far the best option to produce high quality with the lowest resources
          – Especially interesting for newspapers, 20th century material and grey literature
          – Used e.g. by MOA, JSTOR

                                                                                                                                                         17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Good, bad and ugly images
 Careful scanning is A and O
          – Scanrobots and document scanners lower the requirements for a good
            operator, but still individual capability is decisive
 Criteria for a good page image are simple:
          –      sharp
          –      significant fonts with clear curves
          –      clear background, no shining through from the backside
          –      no warping of the page and no geometrical distortions
          –      complete shot with some white frame around the text borders
          –      lines to be parallel resp. rectangle to borders
          –      no noise of users
 If you have perfect images you can wait until OCR technology
  improves, with bad images you never get good results
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         22
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Bad print – broken characters
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




   und                                                                              wenn

                                                                                                                                                         24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                                                                                                                                         25
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Bitonal or 8/24 Bit – 300 or 400 ppi – JPEG or TIFF?
 Bitonal vs. 8/24 bit
          – Rose Holley: Dlib Paper 2009: Grey scanning does not lead to better results
          – Experiment: Microfilm scanned bitonal or greyscale – no difference
 Simple experiments show the opposite
          – Innsbrucker Zeitungsarchiv: bitonal and 24 bit
          – Results are clearly better with colour
 300 or 400 Resolution
          – Very small font: Word text: 4 point font
 JPEG vs. TIFF RGB
          – Tests with the Treventus ScanRobot but also with other material show that
            there is no advantage of TIFF RGB images compared to compressed
            JPEGs
 Modern documents with medium sized fonts can be scanned with 300
  ppi and bitonal, but documents with small fonts and challenging paper
  quality etc. should be scanned with 400 ppi and 8 or 24 bit and can be
  stored as JPEGs with e.g. 90% compression rate
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Accuracy
 Is the glas half full or half empty?
          – Rose Holley <90% word recognition: Poor result
          – Google: OCR every image, so every correctly recognized word is better
            than nothing
          – Painful errors?
          – Mature users?


 Character vs. word accuracy
          – Word accuracy says much more, and is much easier to gain: Each word
            which would be correctly found in a full-text search, can be counted as
            correct.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Examples from real world projects
 Based on: ABBYY Recognition Server 2
          –      Reichstagsprotokolle, 1925
          –      Zedler, 1744
          –      Coburger Zeitung, 1808
          –      Judentum, 1803
          –      Eckartshausen, 1792
          –      Landesbauernkammer, 1921
          –      Galvani, 1793
          –      Hieber, 1722
          –      Hofmann, 1875
          –      Buschendorf, 1805
          –      Schreiben, 1689
          –      Lateinische Texte
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Correction of OCR text
 Until recently regarded as „absurd“
 But:
          – Crowd sourcing
          – New technologies
 Crowd sourcing
          –      Figures from Austrialian Newspaper Project:
          –      Correction via a simple editor: line by line correctioin
          –      Since August 2008 6000 users contributed
          –      7 Mill. lines in 318.000 articles were corrected
          –      If you count 50 characters per line it is worth about 200.000 EUR (=
                 compared to the prices of service providers)
 New technologies
          – IBM: CONCERT Tool, LMU: PostCorrection Tool
          – Productivity compared to simple rekeying will be enhanced by several
            factors (at least 1:5)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




What to do with OCR results?
 Structural enhancement
          – INEX: competition based on OCR files
          – Functional Extension Parser
 Preservation
          –      Complexity is significantly increased
          –      Output: TXT, PDF, ABBYY XML
          –      ALTO Format
          –      How to integrated corrective actions of users?
          –      Proposition for enhancing ALTO format
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Digital library applications
 Fulltext search
          – JSTOR, Google, publishers
          – Facetted Search (SOLR)
 Indexing through search engines
          – Site XML
 Visibility of the OCR text
          – User training (by doing)
          – Necessary if correction shall be included
 New research fields
          – Text mining
          – Linking of texts
          – Near duplicates, similiarity and new identifiers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Summary
 OCR is a „must“
          – For documents of the 19. and 20th century OCR provides in general
            useful or even very good results
          – Bevore 1800: Improvements can be expected by IMPACT
          – Careful and exact scanning is always the main prerequisite, preferable
            in 400 ppi and 8 or 24 bit
          – Test runs with random sets
 Modern applications
          –      Fulltext search
          –      Visibility of the erroneous text
          –      Options for correcting the text by users
          –      Several export formats (also for end-users)
          –      Site XML for search engines
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                Thank you for your attention!

Weitere ähnliche Inhalte

Mehr von IMPACT Centre of Competence

Advanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slidesAdvanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slidesIMPACT Centre of Competence
 

Mehr von IMPACT Centre of Competence (20)

Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 
Session1 02.anna-maria sichani
Session1 02.anna-maria sichaniSession1 02.anna-maria sichani
Session1 02.anna-maria sichani
 
Session1 01.konstantin baierer
Session1 01.konstantin baiererSession1 01.konstantin baierer
Session1 01.konstantin baierer
 
Advanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slidesAdvanced Imaging Services at KU Leuven Libraries Webinar slides
Advanced Imaging Services at KU Leuven Libraries Webinar slides
 
Xii simposi internacional noves tendencies
Xii simposi internacional noves tendenciesXii simposi internacional noves tendencies
Xii simposi internacional noves tendencies
 

Kürzlich hochgeladen

Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...raviapr7
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptxSandy Millin
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxraviapr7
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxEduSkills OECD
 
How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17Celine George
 
Philosophy of Education and Educational Philosophy
Philosophy of Education  and Educational PhilosophyPhilosophy of Education  and Educational Philosophy
Philosophy of Education and Educational PhilosophyShuvankar Madhu
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxMYDA ANGELICA SUAN
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfYu Kanazawa / Osaka University
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...Nguyen Thanh Tu Collection
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICESayali Powar
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and stepobaje godwin sunday
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17Celine George
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfMohonDas
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxAditiChauhan701637
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?TechSoup
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsEugene Lysak
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17Celine George
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxSaurabhParmar42
 
Presentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphPresentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphNetziValdelomar1
 
How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17Celine George
 

Kürzlich hochgeladen (20)

Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...Patient Counselling. Definition of patient counseling; steps involved in pati...
Patient Counselling. Definition of patient counseling; steps involved in pati...
 
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
2024.03.23 What do successful readers do - Sandy Millin for PARK.pptx
 
Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptx
 
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptxPISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
 
How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17How to Use api.constrains ( ) in Odoo 17
How to Use api.constrains ( ) in Odoo 17
 
Philosophy of Education and Educational Philosophy
Philosophy of Education  and Educational PhilosophyPhilosophy of Education  and Educational Philosophy
Philosophy of Education and Educational Philosophy
 
Patterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptxPatterns of Written Texts Across Disciplines.pptx
Patterns of Written Texts Across Disciplines.pptx
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
 
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
CHUYÊN ĐỀ DẠY THÊM TIẾNG ANH LỚP 11 - GLOBAL SUCCESS - NĂM HỌC 2023-2024 - HK...
 
Quality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICEQuality Assurance_GOOD LABORATORY PRACTICE
Quality Assurance_GOOD LABORATORY PRACTICE
 
General views of Histopathology and step
General views of Histopathology and stepGeneral views of Histopathology and step
General views of Histopathology and step
 
How to Solve Singleton Error in the Odoo 17
How to Solve Singleton Error in the  Odoo 17How to Solve Singleton Error in the  Odoo 17
How to Solve Singleton Error in the Odoo 17
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdf
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptx
 
What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?What is the Future of QuickBooks DeskTop?
What is the Future of QuickBooks DeskTop?
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George Wells
 
How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17How to Show Error_Warning Messages in Odoo 17
How to Show Error_Warning Messages in Odoo 17
 
CAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptxCAULIFLOWER BREEDING 1 Parmar pptx
CAULIFLOWER BREEDING 1 Parmar pptx
 
Presentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a ParagraphPresentation on the Basics of Writing. Writing a Paragraph
Presentation on the Basics of Writing. Writing a Paragraph
 
How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17How to Add a many2many Relational Field in Odoo 17
How to Add a many2many Relational Field in Odoo 17
 

Bratislava WS - Mühlberger - OCR in libraries_pdf

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR in libraries – some practical remarks Günter Mühlberger Department for Digitisation and Digital Preservation University Innsbruck Library
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR in Libraries  Not an easy chapter...  Is the glass half empty or half full?  Historical fonts: Black letter, gothic, Old Cyrillic, ...  Great attempts for full-text – JSTOR (1994) – Google (2004)  But: Still many digital libraries without integrated full-text
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR and Digitization  OCR changes everything!  Workflow has to be adopted at all steps – Preparation and selection of material – Image processing & scanning – Quality control – Storage and preservation – Correction and user involvement – Full-text search – Web interfaces for digital libraries  Significant increase in complexity
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 4
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 5
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 6
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 8
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 9
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 12
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 13
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Preparation  Which material will be taken for scanning? Options: – Bound volumes? – Microfilm? – Loose folios?
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Bound volumes  Bound volumes – Pros:  That’s the way books/journals/newspapers are in the library – Cons:  Often narrow binding, especially with newspapers  Often warping due to humidity – Remark  Technical solution: ScanRobots make life easier and double the speed compared to manual interaction, e.g. 700 – 1000 pages per hour  Investment for ScanRobots must not be underestimated 15
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Microfilm  Microfilm – Pros:  If a microfilm is available it is a cheap alternative  Easy option (no handling of volumes) – Cons:  Microfilms have the same problems as bound volumes  Microfilms were often produced with minimum quality control  Microfilms before 1990 are often not in a good condition  Remark – If microfilm was produced with good quality than there is no significant difference in the OCR quality  Case study with BL material will be published on IMPACT site 16
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Option: Loose folios  Pros – No narrow binding, less warping – Extremely fast performance with industry scanners – low price – Duplicates can be sent to off-shore providers in huge packages  Cons – Not feasible for material before 1850 – libraries would run into justification problems – Organisational effort to organise duplicates (but completeness has to be evaluated anyway)  Remark – By far the best option to produce high quality with the lowest resources – Especially interesting for newspapers, 20th century material and grey literature – Used e.g. by MOA, JSTOR 17
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Good, bad and ugly images  Careful scanning is A and O – Scanrobots and document scanners lower the requirements for a good operator, but still individual capability is decisive  Criteria for a good page image are simple: – sharp – significant fonts with clear curves – clear background, no shining through from the backside – no warping of the page and no geometrical distortions – complete shot with some white frame around the text borders – lines to be parallel resp. rectangle to borders – no noise of users  If you have perfect images you can wait until OCR technology improves, with bad images you never get good results
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 19
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 21
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 22
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Bad print – broken characters
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. und wenn 24
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 25
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Bitonal or 8/24 Bit – 300 or 400 ppi – JPEG or TIFF?  Bitonal vs. 8/24 bit – Rose Holley: Dlib Paper 2009: Grey scanning does not lead to better results – Experiment: Microfilm scanned bitonal or greyscale – no difference  Simple experiments show the opposite – Innsbrucker Zeitungsarchiv: bitonal and 24 bit – Results are clearly better with colour  300 or 400 Resolution – Very small font: Word text: 4 point font  JPEG vs. TIFF RGB – Tests with the Treventus ScanRobot but also with other material show that there is no advantage of TIFF RGB images compared to compressed JPEGs  Modern documents with medium sized fonts can be scanned with 300 ppi and bitonal, but documents with small fonts and challenging paper quality etc. should be scanned with 400 ppi and 8 or 24 bit and can be stored as JPEGs with e.g. 90% compression rate
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Accuracy  Is the glas half full or half empty? – Rose Holley <90% word recognition: Poor result – Google: OCR every image, so every correctly recognized word is better than nothing – Painful errors? – Mature users?  Character vs. word accuracy – Word accuracy says much more, and is much easier to gain: Each word which would be correctly found in a full-text search, can be counted as correct.
  • 28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Examples from real world projects  Based on: ABBYY Recognition Server 2 – Reichstagsprotokolle, 1925 – Zedler, 1744 – Coburger Zeitung, 1808 – Judentum, 1803 – Eckartshausen, 1792 – Landesbauernkammer, 1921 – Galvani, 1793 – Hieber, 1722 – Hofmann, 1875 – Buschendorf, 1805 – Schreiben, 1689 – Lateinische Texte
  • 29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Correction of OCR text  Until recently regarded as „absurd“  But: – Crowd sourcing – New technologies  Crowd sourcing – Figures from Austrialian Newspaper Project: – Correction via a simple editor: line by line correctioin – Since August 2008 6000 users contributed – 7 Mill. lines in 318.000 articles were corrected – If you count 50 characters per line it is worth about 200.000 EUR (= compared to the prices of service providers)  New technologies – IBM: CONCERT Tool, LMU: PostCorrection Tool – Productivity compared to simple rekeying will be enhanced by several factors (at least 1:5)
  • 30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. What to do with OCR results?  Structural enhancement – INEX: competition based on OCR files – Functional Extension Parser  Preservation – Complexity is significantly increased – Output: TXT, PDF, ABBYY XML – ALTO Format – How to integrated corrective actions of users? – Proposition for enhancing ALTO format
  • 31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Digital library applications  Fulltext search – JSTOR, Google, publishers – Facetted Search (SOLR)  Indexing through search engines – Site XML  Visibility of the OCR text – User training (by doing) – Necessary if correction shall be included  New research fields – Text mining – Linking of texts – Near duplicates, similiarity and new identifiers
  • 32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Summary  OCR is a „must“ – For documents of the 19. and 20th century OCR provides in general useful or even very good results – Bevore 1800: Improvements can be expected by IMPACT – Careful and exact scanning is always the main prerequisite, preferable in 400 ppi and 8 or 24 bit – Test runs with random sets  Modern applications – Fulltext search – Visibility of the erroneous text – Options for correcting the text by users – Several export formats (also for end-users) – Site XML for search engines
  • 33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you for your attention!