Main steps to build a digital library:
Data collection and digitization
Metadata selection and designing the digital library interface
Annotation of digitized data (may be word spotting as well)
Information retrieval techniques
2. Contents
2
Introduction
How to build a digital library
Digital library for documents
Annotation of documents
ATAOH tool
User interfaces
Performance evaluation
More about DLs
3. Introduction
3
Situation: Spread of computers, ease of information
processing (edit-save-spread).
Objective: convert information (hand sketching, machine
printed paper, etc.) into digital format.
Reason: Electronic storage, Spread information,
Facilitate data access, save effort, and provide more
services in less time.
Problem: Information Overflow
4. Introduction
4
Solution: organize information, make them
accessible and searchable Digital Libraries
Construction.
Digital library: software supporting direct content-
based retrieval through “queries by search key
words”
Digital libraries types: printed text and books,
scanned handwritten pages, multimedia material.
5. How to build a digital library
5
Main steps to build a digital library:
Data collection and digitization
Metadata selection and designing the digital library interface
Annotation of digitized data (may be word spotting as well)
Information retrieval techniques
6. Digital library for documents
6
Digital Library Aspects: accurate and fast
(automated) grouping, filing, indexing and retrieval.
The handwritten data is either offline (scanned
paper image) or online (pen movement on electronic
surface: ink).
These online or offline documents are non-textual
documents of textual content.
7. Annotation of documents
7
Thus, Library construction requires the knowledge of
the textual content (transcription), also called
“Ground truth” or “Annotation information”.
Annotation: identifying data of particular type using
additional data of different type, precisely describing
its entities.
Documents annotation: Associating the
ASCII/UNICODE corresponding to the
paragraph/sentence/word/character image/ink.
8. Annotation of documents
8
Reason: Conventional text search and information
retrieval (Digital libraries or Web search engines)
is based on matching or comparison of textual
description (say in ASCII/UNICODE).
Annotation extends the conventional textual
search to image/ink representation of these
documents.
10. Annotation of documents
10
Situation: every region of interest (line, word or
character) needs to be identified and annotated
manual annotation
Problem: It is a laborious, time-consuming and
error-prone process, especially for huge corpora
annotated at the character level.
Solution: Semi-Automatic and Automatic schemes
of annotation
12. Annotation of documents
12
Annotation tools construction
In document annotation, specific details
(metadata) are extracted and tagged into XML
documents (meta-document).
The XML representation is a hierarchical
organization of data. Each level of hierarchy
contains a label element that captures annotation
at that level.
14. Annotation of documents
14
Document retrieval needs all metadata, while
handwriting recognition needs only ground truth
of image/ink trace.
Ground-truthing a document image: annotating
the regions, text lines, words and characters.
Few automatic and semi-automatic ground-
truthing annotation tools for handwritten text
exist.
15. Annotation of documents
15
Annotation tools construction
Lines, words and strokes are segmented manually or
automatically
Automatic/semi-automatic/manual labeling (truthing)
of the required entity
Manual segmentation-annotation correction through
interface supports by mouse clicks and keyboard
shortcuts.
16. Annotation of documents
16
Annotation tools
construction:
Literature survey Conclusions:
Arabic language research is lacking
language resources like public data
sets and tools for data collection,
annotation and pre-processing.
Document segmentation is the
most important requirement in
annotation tools.
Segmentation is complex, yet
automated systems have not
reached human accuracy.
17. Annotation of documents
17
Annotation tools construction
Literature survey Conclusions:
Higher-level segmentation algorithms is more error
prone, and require higher reject thresholds and more
expertise will be required of the operator.
The most significant expense of human annotation, is
human time. Even a 30% reduction in overall human
time will be significant in an operational application.
18. Annotation of documents
18
Annotation tools construction
Literature survey Conclusions:
The annotation tool design should provide:
1. Easy document browsing & multiple format support.
2.Ease of annotation and display.
3. Automatic Text-line/Word segmentation and ground
truthing.
4.Manual options for segmentation validation &
annotation correction.
19. ATAOH Tool
19
ATAOH Tool: annotation tool for Arabic onlineannotation tool for Arabic online
handwritinghandwriting
1. Easy document browsing and display.
2.Automatic Text-line/Word extraction-
segmentation.
3.Manual options for segmentation validation &
annotation correction.
20. ATAOH Tool
20
4. Composed of a guiding set of interactive user
interfaces.
5. Reduces human effort by high performance
automation
6. Annotates Arabic words at the character level to
provide annotated datasets for handwriting
recognizer training
21. ATAOH Tool: User Interfaces
21
The Main GUI opens at the start up showing the userThe Main GUI opens at the start up showing the user
all operations that can be done.all operations that can be done.
22. ATAOH Tool: User Interfaces
22
Word Extraction GUI: appears at pressing “Word Extraction” pushbuttonWord Extraction GUI: appears at pressing “Word Extraction” pushbutton
on the Main GUI & specifying the document path. Automatic text lineon the Main GUI & specifying the document path. Automatic text line
extraction is done and each text line is displayed on the GUI successivelyextraction is done and each text line is displayed on the GUI successively
23. ATAOH Tool: User Interfaces
23
The Add Transcription GUI appears when pressing the “Transcript Data File”The Add Transcription GUI appears when pressing the “Transcript Data File”
pushbutton on the Main GUI and specifying document path. The previouslypushbutton on the Main GUI and specifying document path. The previously
extracted words are displayed successivelyextracted words are displayed successively
24. ATAOH Tool: User Interfaces
24
Annotation is done by entering the word truth in the ground truth text area.Annotation is done by entering the word truth in the ground truth text area.
25. ATAOH Tool: User Interfaces
25
Automatic segmentation can be done it by pressing “Auto Segment” pushbutton.Automatic segmentation can be done it by pressing “Auto Segment” pushbutton.
26. ATAOH Tool: User Interfaces
26
Manually segmentation correction is done drawing lines by mouse clicksManually segmentation correction is done drawing lines by mouse clicks
“Manual Segment” pushbutton“Manual Segment” pushbutton
27. ATAOH Tool: User Interfaces
27
Each character model strokes data are calculated and displayed by pressing 'InsertEach character model strokes data are calculated and displayed by pressing 'Insert
data' pushbutton.data' pushbutton.
28. ATAOH Tool: User Interfaces
28
'CHECK' pushbutton plots each character model in a separate figure.'CHECK' pushbutton plots each character model in a separate figure.
29. ATAOH Tool: User Interfaces
29
In the output text file format, each word is indexed. EachIn the output text file format, each word is indexed. Each
character names is listed in order (from right to left).character names is listed in order (from right to left).
Beside each character name, stroke information is listedBeside each character name, stroke information is listed
(prototype , number of stroke parts, stroke number(s)(prototype , number of stroke parts, stroke number(s)
and start(s) and end(s) indices.and start(s) and end(s) indices.
30. Annotation Performance Evaluation
30
We collected a privateWe collected a private
data set of online Arabicdata set of online Arabic
handwritings and usedhandwritings and used
it for training and test.it for training and test.
AWAT: average wordAWAT: average word
annotation time.annotation time.
ADAT: averageADAT: average
document annotationdocument annotation
time.time.
32. More about DLs
32
Digital libraries are more than just web sites or stores
of information in digital libraries.
Designers need to
provide efficient ways to structure information, and
represent them digitally using computers.
To design good, usable digital libraries, one requires
knowledge about:
who will use them,
what they will be used for,
the work context and the environment in which they will be used,
and
what is technically and logistically feasible.
33. More about DLs
33
Designing good, usable interfaces is not an easy
task. Using the best methodology and model in the
design of a usable interactive system is not enough.
One still needs to assess the design and test the
system to ensure that it behaves as expected and
meets end-users' requirements
it is impossible to design an optimal user interface
in the first try
34. More about DLs
34
Typical usability defects for interactive systems
which include:
navigation;
screen design and layout;
terminology;
feedback;
consistency;
modality;
redundancies;
end-user control and match with end-user tasks.
35. More about DLs
35
Evaluation Criteria
Collection size:
Number of items
Type of items
Estimated storage space
Metadata:
Is there existing metadata?
Is it available in electronic form?
User access functions
Is there a feasible vision for how the materials will be
accessed by and delivered to researchers?