By:
Ossama Obeid, Houda Bouamor, Wajdi Zaghouani, Mahmoud Ghoneim, Abdelati Hawwari, Mona Diab and Kemal Oflazer
Abstract
In this paper, we introduce MANDIAC, a web-based annotation system designed for rapid manual diacritization of Standard Arabic text. To expedite the annotation process, the system provides annotators with a choice of automatically generated diacritization possibilities for each word. Our framework provides intuitive interfaces for annotating text and managing the diacritization annotation process. In this paper we describe the annotation and the administration interfaces as well as the back-end engine. Finally, we demonstrate that our system doubles the annotation speed compared to using a regular text editor.
P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization
1. MANDIAC: A Web-based
Annotation System For Manual
Arabic Diacritization
Collaborators: Houda Bouamor, Wajdi Zaghouani, Mahmoud
Ghoneim, Abdelati Hawwari, Mona Diab and Kemal Oflazer
Ossama Obeid
Carnegie Mellon University in Qatar
owo@qatar.cmu.edu
2. Introduction
• Arabic text is composed of consonants, long vowels, and short
vowels (diacritics).
• Absence of diacritics:
o Adds lexical and morphological ambiguity.
o Confusing to beginners.
o Impacts performance of Arabic NLP tasks.
• Very few texts are diacritized.
4. Introduction
• Most automatic diacritization systems trained on Arabic
Treebanks.
• Different genre and dialects need new datasets:
o Time consuming.
o Must insure data quality and consistency.
5. Currently Available Annotation Tools
• Very basic text-editor-like interfaces.
• Can’t handle a large number of documents and annotators.
• Not easily customizable.
6. MANDIAC
• Web-based.
• Intuitive and easy to use.
• Easily manages thousands of documents.
• Distributes tasks (including IAA evaluation tasks) to tens of
annotators .
• Doubles annotation speed!
• Based on QAWI.
• Provides Annotation and Annotation Management interfaces.
7. Annotation Interface
• Token-based annotation system similar to QAWI.
• Annotators can choose pre-computed diacritizations (derived
using MADAMIRA) and/or manually edit diacritics.
• Additional features to increase annotator productivity.
8. Annotation Interface
Extra Features:
• Undo/Redo buttons
• Edits restricted to diacritics only
• Timer
• Counter indicating number of words left to annotate
• Link to annotation guidelines
• Token highlighting:
o Annotated words
o Tokens that should not be edited (eg digits, non-Arabic words, punctuation)
• Flag documents
• Mark tokens as ambiguous
12. Management Interface
Annotation Workflow Management:
• Upload files in various formats.
• Organize files into groups.
• Assign files to individuals or to a group (for IAA).
• Highlight tasks as untouched, edited, or completed.
13. Management Interface
Evaluation and Monitoring:
• Evaluate IAA.
• Compare annotations to gold reference.
• Use WER and DER as metrics.
• 10% of assigned documents are randomly assigned for IAA.
16. System Design and Architecture
• Four main components:
o Annotation interface
o Management interface
o Back-end server
o MADAMIRA
Component interaction diagram
17. System Design and Architecture
Data storage:
• Relational database (SQL):
o Fast data search and retrieval.
o Almost any SQL database can be used.
• Annotation data stored as JSON blobs:
o Flexible data format.
o Quickly add new functionality and annotation modes with little back-end
modification.
18. Evaluation
Experimental setup:
• Around 1,500 words were extracted from Penn Arabic Treebank.
• Five annotators were asked to fully diacritize the extracted words:
o First half of the text using a text editor.
o Second half of the text with MANDIAC:
− Use automatically diacritized candidate if possible.
− Manually edit otherwise.
19. Evaluation
• Experimental results:
o Using a text editor: 302 words/hour
o Using MANDIAC: 618 words/hour
• Using the text editor introduced typos.
20. Acknowledgements
• This project has been funded by the Qatar National Research
Fund (grant NPRP 6-1020-1-199).
• We also thank the annotators for their feedback on MANDIAC.