Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
IMPACT Final Event 26-06-2012 - Summary of IMPACT project & results by Hildelies Balk (KB, IMPACT Project Director)
1. Click to edit document name
IMPACT: Challenges
and solutions
Hildelies Balk, IMPACT
Project Director, KB
National Library of the
Netherlands
2. IMPACT: Challenges and solutions
Overview of this presentation
• Challenges in digitisation of historical full text
• IMPACT objectives
• Approach
• Achievements
• Better, Faster, Cheaper
3. IMPACT: Challenges and solutions
The content
• Shared vision in
Europe: all cultural
heritage available in
digital form in this
decade
• Billions of pages of
historical (pre-1900)
text in libraries in
Europe
• Users expect full text
to search, tag and re-
use
• Just image and
metadata not enough
3
4. IMPACT: Challenges and solutions
The full text
VVt Venetien den 1.Junij, Anno 1618.
DJgn i f paffato te S' aö'Jifeert mo?üen/bah
.)etgi'uotbciraetail)i.r/JtmelchontDecht
te / sbnbe bele btr felbrr geiufttceert baer bnber
eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu
enbeeemgljen bifet Cbeiiupcen berbonbru befe
7. IMPACT: Challenges and solutions
Answering the challenges: IMPACT
IMPACT – Improving Access to Text (2008-2011)
• Large-scale integrating research project
• Consortium of 26 partners
• Coordinated by the National Library of the Netherlands (KB)
• Co-funded by EU (FP7 ICT Work Programme)
Objectives:
Significantly improve mass digitisation of historical printed text by:
• Innovating OCR software and language technology
• Sharing expertise and building capacity across Europe
• Providing facilities for future research and development
Making text digitisation better, faster, cheaper!
8. IMPACT: Challenges and solutions
IMPACT Approach
• Content holders, researchers and industry work together to find solutions
• Based on real life problems in digitisation
• Tackle each step in the digitisation workflow from scan to full text
-/-/-/-/-/-
/-/-/-/-/-
/-/-/-/-/-
/-/-/-/-/-
/-/-/-/-/-
OCR Post correction and Enrichment
ABBY FR CONCERT IBM
Image enhancement:
Preparation and IBM Adaptive Error Profiler LMU
Binarisation
scanning: Dictonaries/interface
noise removal Language resources 9 partners
guidelines and Segmentation and
geometrical defects LMU,INL Platform for document
case studies Document analysis
correction Experimental engines understanding based on OCR UIBK
All partners NSCR,USAL, ABBYY USAL,NCSR,ABBYY
USAL,NCSR,UIBK
8
9. IMPACT: Challenges and solutions
IMPACT Approach - continued
• Tools to be coupled in Interoperability Framework
• Tested with Evaluation tools and metrics
• Against representative set of test data with Ground Truth
• Basis for further research and development
9
10. IMPACT: Challenges and solutions
Achievements: summary
On market: Improved ABBYY FR Engine 10, Recogition Server 3, Cloud OCR
In use in productive environment:
• Service for document structure recognition
• Dutch and Slovene dictionary
• Alethia
Ready for testing in productive environment:
• Adaptive OCR engine
• Tools for OCR correction with volunteer involvement
• Computer lexica for nine languages
• Digitisation Framework with evaluation tools and dataset
• Knowledge bank with guidelines and learning resources
For future development:
• Novel Approaches to preprocessing, OCR and post correction
• New language resources with Tools for lexicon building
impact Centre of Competence for digitisation
• Added value: Unique network bringing together experts from different communities
12. IMPACT: Challenges and solutions
Better: rule set for extracting table of
content entries from historical books
outperforms best results of the
Results: better & faster INEX competition 2011
• All tools evaluated in different test scenarios on IMPACT dataset
• All individual tools show improvement on state of the art Faster: postcorrection with
• Some examples of results – there is more! Error Profiler up to 2,7 times
faster than without
Better: Tested on
Better: hybrid line Better: recognition of
38718 randomly
segmentation on 2.700 old fonts FR9→FR10 Faster: CONCERT
selected historical
text lines SOA 90,9 % → 25% reduction of errors increases correction
images and
IMPACT 98,8% speed up to 40%
achieved a success
of 98.93% (SoA up to Better, faster:Adaptive
97.3% OCR on small testset
halves FOM (post Better: language
processing level required) resources show
improvement for all 9
languages
OCR
ABBY FR Post correction and Enrichment
Image enhancement:
IBM Adaptive CONCERT IBM
Binarisation Segmentation and
Noise removal Dictonaries / interface Error Profiler LMU
Document analysis
Geometrical defects LMU,INL Language resources 9 partners
USAL, NCSR, ABBYY
correction Experimental OCR engines Document Understanding Platform
NSCR, USAL, ABBYY USAL, NCSR, UIBK UIBK
12
13. IMPACT: Challenges and solutions
Results: cheaper
Industry in IMPACT:
• ABBYY FR Historic Fonts Module more than 10 times cheaper; more flexible rates
overall
• IBM Adaptive OCR and CONCERT: flexible rates
Research in IMPACT:
• Key Language resources free
• All tools by research partners free for research and free / low rates on non
commercial use (individual licensing required), subject
to volume, kind of use and material, support etc.
Framework:
• Digitisation Framework free and open source
• Open source wrapper to plug in other tools
• Fruitful contacts with new tool providers
14. IMPACT: Challenges and solutions
Benefits
For the digital library
• Rough average of all tests by developers on IMPACT dataset
indicates consistent improvement of up to 20%
• Better access, faster and cheaper production
For the end user
• main interest: retrieval = words searched and found correctly
• Preliminary results of ABBYY FR 10 with Dutch lexicon on difficult
material (Dutch 17th century newspaper): 15% increase of words
found
For 1 M words this means 150 K more words found
...and this is just the beginning!