SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
MANDIAC: A Web-based
Annotation System For Manual
Arabic Diacritization
Collaborators: Houda Bouamor, Wajdi Zaghouani, Mahmoud
Ghoneim, Abdelati Hawwari, Mona Diab and Kemal Oflazer
Ossama Obeid
Carnegie Mellon University in Qatar
owo@qatar.cmu.edu
Introduction
• Arabic text is composed of consonants, long vowels, and short
vowels (diacritics).
• Absence of diacritics:
o Adds lexical and morphological ambiguity.
o Confusing to beginners.
o Impacts performance of Arabic NLP tasks.
• Very few texts are diacritized.
Introduction
Possible pronunciation and meanings of the undiacritized Arabic
word ‫.ذﻛر‬
Introduction
• Most automatic diacritization systems trained on Arabic
Treebanks.
• Different genre and dialects need new datasets:
o Time consuming.
o Must insure data quality and consistency.
Currently Available Annotation Tools
• Very basic text-editor-like interfaces.
• Can’t handle a large number of documents and annotators.
• Not easily customizable.
MANDIAC
• Web-based.
• Intuitive and easy to use.
• Easily manages thousands of documents.
• Distributes tasks (including IAA evaluation tasks) to tens of
annotators .
• Doubles annotation speed!
• Based on QAWI.
• Provides Annotation and Annotation Management interfaces.
Annotation Interface
• Token-based annotation system similar to QAWI.
• Annotators can choose pre-computed diacritizations (derived
using MADAMIRA) and/or manually edit diacritics.
• Additional features to increase annotator productivity.
Annotation Interface
Extra Features:
• Undo/Redo buttons
• Edits restricted to diacritics only
• Timer
• Counter indicating number of words left to annotate
• Link to annotation guidelines
• Token highlighting:
o Annotated words
o Tokens that should not be edited (eg digits, non-Arabic words, punctuation)
• Flag documents
• Mark tokens as ambiguous
Annotation Interface
Annotation Interface at a glance
Annotation Interface
Dropdown showing top 3
automatically diacritized
candidates.
Manual token editor
Management Interface
User Management
• Add/remove users.
• Add users to annotation groups.
• Display user activity log and statistics.
Management Interface
Annotation Workflow Management:
• Upload files in various formats.
• Organize files into groups.
• Assign files to individuals or to a group (for IAA).
• Highlight tasks as untouched, edited, or completed.
Management Interface
Evaluation and Monitoring:
• Evaluate IAA.
• Compare annotations to gold reference.
• Use WER and DER as metrics.
• 10% of assigned documents are randomly assigned for IAA.
Management Interface
User management view
Management Interface
Task assignment popup
Task	list	view
System Design and Architecture
• Four main components:
o Annotation interface
o Management interface
o Back-end server
o MADAMIRA
Component interaction diagram
System Design and Architecture
Data storage:
• Relational database (SQL):
o Fast data search and retrieval.
o Almost any SQL database can be used.
• Annotation data stored as JSON blobs:
o Flexible data format.
o Quickly add new functionality and annotation modes with little back-end
modification.
Evaluation
Experimental setup:
• Around 1,500 words were extracted from Penn Arabic Treebank.
• Five annotators were asked to fully diacritize the extracted words:
o First half of the text using a text editor.
o Second half of the text with MANDIAC:
− Use automatically diacritized candidate if possible.
− Manually edit otherwise.
Evaluation
• Experimental results:
o Using a text editor: 302 words/hour
o Using MANDIAC: 618 words/hour
• Using the text editor introduced typos.
Acknowledgements
• This project has been funded by the Qatar National Research
Fund (grant NPRP 6-1020-1-199).
• We also thank the annotators for their feedback on MANDIAC.
Thank You!

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (10)

La city presentation
La city presentation La city presentation
La city presentation
 
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
P05- DINA: A Multi-Dialect Dataset for Arabic Emotion Analysis
 
Sociedad de la información
Sociedad de la informaciónSociedad de la información
Sociedad de la información
 
Γλώσσα - κείμενα
Γλώσσα -  κείμεναΓλώσσα -  κείμενα
Γλώσσα - κείμενα
 
Innovation in the Classroom: Keeping Digital Innovation Alive in Schools
Innovation in the Classroom: Keeping Digital Innovation Alive in SchoolsInnovation in the Classroom: Keeping Digital Innovation Alive in Schools
Innovation in the Classroom: Keeping Digital Innovation Alive in Schools
 
المشروع القومى لتمكين الصم بإستخدام التكنولوجيا
المشروع القومى لتمكين الصم بإستخدام التكنولوجياالمشروع القومى لتمكين الصم بإستخدام التكنولوجيا
المشروع القومى لتمكين الصم بإستخدام التكنولوجيا
 
paragangliomas
paragangliomas paragangliomas
paragangliomas
 
Towards Robust and Safe Autonomous Drones
Towards Robust and Safe Autonomous DronesTowards Robust and Safe Autonomous Drones
Towards Robust and Safe Autonomous Drones
 
Sound engineer cv pfd 3
Sound engineer cv pfd 3Sound engineer cv pfd 3
Sound engineer cv pfd 3
 
TRATADOS COMERCIALES DE MÉXICO CON EL MUNDO
TRATADOS COMERCIALES DE MÉXICO CON EL MUNDO TRATADOS COMERCIALES DE MÉXICO CON EL MUNDO
TRATADOS COMERCIALES DE MÉXICO CON EL MUNDO
 

Ähnlich wie P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization

Arabic Text mining Classification
Arabic Text mining Classification Arabic Text mining Classification
Arabic Text mining Classification
Zakaria Zubi
 

Ähnlich wie P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization (20)

Document management #RWIRW
Document management #RWIRWDocument management #RWIRW
Document management #RWIRW
 
Alison McNab - Document management tools for the next decade: writing, citing...
Alison McNab - Document management tools for the next decade: writing, citing...Alison McNab - Document management tools for the next decade: writing, citing...
Alison McNab - Document management tools for the next decade: writing, citing...
 
Introduction
IntroductionIntroduction
Introduction
 
The Characteristics of a Successful SPA
The Characteristics of a Successful SPAThe Characteristics of a Successful SPA
The Characteristics of a Successful SPA
 
Web Development
Web DevelopmentWeb Development
Web Development
 
Minor Project.pptx
Minor Project.pptxMinor Project.pptx
Minor Project.pptx
 
Presentation 1 Web--dev
Presentation 1 Web--devPresentation 1 Web--dev
Presentation 1 Web--dev
 
Introduction to php
Introduction to phpIntroduction to php
Introduction to php
 
Single page application and Framework
Single page application and FrameworkSingle page application and Framework
Single page application and Framework
 
Arabic Text mining Classification
Arabic Text mining Classification Arabic Text mining Classification
Arabic Text mining Classification
 
Accessibility (WCAG) Draft 1
Accessibility (WCAG) Draft 1Accessibility (WCAG) Draft 1
Accessibility (WCAG) Draft 1
 
Full Stack Web Development
Full Stack Web DevelopmentFull Stack Web Development
Full Stack Web Development
 
Organising and Managing Research
Organising and Managing ResearchOrganising and Managing Research
Organising and Managing Research
 
IWMW 2003 b4 QA for web sites (4 - QA for MIMAS: A Case Study)
IWMW 2003 b4 QA for web sites (4 - QA for MIMAS: A Case Study)IWMW 2003 b4 QA for web sites (4 - QA for MIMAS: A Case Study)
IWMW 2003 b4 QA for web sites (4 - QA for MIMAS: A Case Study)
 
Olympya web-tools 2011
Olympya web-tools 2011Olympya web-tools 2011
Olympya web-tools 2011
 
WEB DEVELOPMENT.pptx
WEB DEVELOPMENT.pptxWEB DEVELOPMENT.pptx
WEB DEVELOPMENT.pptx
 
After the LAMP, it's time to get MEAN
After the LAMP, it's time to get MEANAfter the LAMP, it's time to get MEAN
After the LAMP, it's time to get MEAN
 
Code Inspection
Code InspectionCode Inspection
Code Inspection
 
Full Stack Web Development | MAGES Institute
Full Stack Web Development | MAGES Institute Full Stack Web Development | MAGES Institute
Full Stack Web Development | MAGES Institute
 
introduction to web engineering.pdf
introduction to web engineering.pdfintroduction to web engineering.pdf
introduction to web engineering.pdf
 

Mehr von iwan_rg

Automatic text simplification evaluation aspects
Automatic text simplification  evaluation aspectsAutomatic text simplification  evaluation aspects
Automatic text simplification evaluation aspects
iwan_rg
 
محاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتهامحاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتها
iwan_rg
 
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـالتقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
iwan_rg
 
Sketch engine presentation
Sketch engine presentationSketch engine presentation
Sketch engine presentation
iwan_rg
 
المدونات اللغوية وتطبيقاتها في التعليم
المدونات اللغوية وتطبيقاتها في التعليمالمدونات اللغوية وتطبيقاتها في التعليم
المدونات اللغوية وتطبيقاتها في التعليم
iwan_rg
 

Mehr von iwan_rg (20)

Automatic text simplification evaluation aspects
Automatic text simplification  evaluation aspectsAutomatic text simplification  evaluation aspects
Automatic text simplification evaluation aspects
 
تلخيص كتاب مقدمة في معالجة اللغة العربية
تلخيص كتاب مقدمة في معالجة اللغة العربيةتلخيص كتاب مقدمة في معالجة اللغة العربية
تلخيص كتاب مقدمة في معالجة اللغة العربية
 
Building theoretical models using structured equation modeling
Building theoretical models using structured equation modelingBuilding theoretical models using structured equation modeling
Building theoretical models using structured equation modeling
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)Introduction to Arabic natural language processing (Infographics)
Introduction to Arabic natural language processing (Infographics)
 
Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...Summary of Multilingual Natural Language Processing Applications: From Theory...
Summary of Multilingual Natural Language Processing Applications: From Theory...
 
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـالتقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
التقرير السنوي لمجموعة إيوان البحثية 1437هـ-1438هـ
 
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERSCHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
CHOOSING RESEARCH TOPICS AND WRITING RESEARCH PAPERS
 
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـالتقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
التقرير السنوي لمجموعة إيوان البحثية 1436هـ-1437هـ
 
مركز تميز الحوسبة العربية المتقدمة
مركز تميز  الحوسبة العربية المتقدمةمركز تميز  الحوسبة العربية المتقدمة
مركز تميز الحوسبة العربية المتقدمة
 
P02- Towards a New Arabic Corpus of Dyslexic Texts
P02- Towards a New Arabic Corpus of Dyslexic TextsP02- Towards a New Arabic Corpus of Dyslexic Texts
P02- Towards a New Arabic Corpus of Dyslexic Texts
 
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
P01- Toward a rich Arabic Speech Parallel Corpus for Algerian sub-Dialects
 
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
Keynote - Computational Processing of Arabic Dialects: Challenges, Advances a...
 
OSACT2 LREC 2016 workshop proceedings
OSACT2 LREC 2016 workshop proceedingsOSACT2 LREC 2016 workshop proceedings
OSACT2 LREC 2016 workshop proceedings
 
محاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتهامحاضرة المدونات اللغوية وأدواتها
محاضرة المدونات اللغوية وأدواتها
 
لغويات المدونة الحاسوبية
لغويات المدونة الحاسوبيةلغويات المدونة الحاسوبية
لغويات المدونة الحاسوبية
 
iWAN Annual Report 1435/1436H
 iWAN Annual Report 1435/1436H iWAN Annual Report 1435/1436H
iWAN Annual Report 1435/1436H
 
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـالتقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ -  1435هـ
التقرير السنوي لمجموعة إيوان البحثية لعام 1434 هـ - 1435هـ
 
Sketch engine presentation
Sketch engine presentationSketch engine presentation
Sketch engine presentation
 
المدونات اللغوية وتطبيقاتها في التعليم
المدونات اللغوية وتطبيقاتها في التعليمالمدونات اللغوية وتطبيقاتها في التعليم
المدونات اللغوية وتطبيقاتها في التعليم
 

Kürzlich hochgeladen

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Kürzlich hochgeladen (20)

ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 

P03- MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization

  • 1. MANDIAC: A Web-based Annotation System For Manual Arabic Diacritization Collaborators: Houda Bouamor, Wajdi Zaghouani, Mahmoud Ghoneim, Abdelati Hawwari, Mona Diab and Kemal Oflazer Ossama Obeid Carnegie Mellon University in Qatar owo@qatar.cmu.edu
  • 2. Introduction • Arabic text is composed of consonants, long vowels, and short vowels (diacritics). • Absence of diacritics: o Adds lexical and morphological ambiguity. o Confusing to beginners. o Impacts performance of Arabic NLP tasks. • Very few texts are diacritized.
  • 3. Introduction Possible pronunciation and meanings of the undiacritized Arabic word ‫.ذﻛر‬
  • 4. Introduction • Most automatic diacritization systems trained on Arabic Treebanks. • Different genre and dialects need new datasets: o Time consuming. o Must insure data quality and consistency.
  • 5. Currently Available Annotation Tools • Very basic text-editor-like interfaces. • Can’t handle a large number of documents and annotators. • Not easily customizable.
  • 6. MANDIAC • Web-based. • Intuitive and easy to use. • Easily manages thousands of documents. • Distributes tasks (including IAA evaluation tasks) to tens of annotators . • Doubles annotation speed! • Based on QAWI. • Provides Annotation and Annotation Management interfaces.
  • 7. Annotation Interface • Token-based annotation system similar to QAWI. • Annotators can choose pre-computed diacritizations (derived using MADAMIRA) and/or manually edit diacritics. • Additional features to increase annotator productivity.
  • 8. Annotation Interface Extra Features: • Undo/Redo buttons • Edits restricted to diacritics only • Timer • Counter indicating number of words left to annotate • Link to annotation guidelines • Token highlighting: o Annotated words o Tokens that should not be edited (eg digits, non-Arabic words, punctuation) • Flag documents • Mark tokens as ambiguous
  • 10. Annotation Interface Dropdown showing top 3 automatically diacritized candidates. Manual token editor
  • 11. Management Interface User Management • Add/remove users. • Add users to annotation groups. • Display user activity log and statistics.
  • 12. Management Interface Annotation Workflow Management: • Upload files in various formats. • Organize files into groups. • Assign files to individuals or to a group (for IAA). • Highlight tasks as untouched, edited, or completed.
  • 13. Management Interface Evaluation and Monitoring: • Evaluate IAA. • Compare annotations to gold reference. • Use WER and DER as metrics. • 10% of assigned documents are randomly assigned for IAA.
  • 15. Management Interface Task assignment popup Task list view
  • 16. System Design and Architecture • Four main components: o Annotation interface o Management interface o Back-end server o MADAMIRA Component interaction diagram
  • 17. System Design and Architecture Data storage: • Relational database (SQL): o Fast data search and retrieval. o Almost any SQL database can be used. • Annotation data stored as JSON blobs: o Flexible data format. o Quickly add new functionality and annotation modes with little back-end modification.
  • 18. Evaluation Experimental setup: • Around 1,500 words were extracted from Penn Arabic Treebank. • Five annotators were asked to fully diacritize the extracted words: o First half of the text using a text editor. o Second half of the text with MANDIAC: − Use automatically diacritized candidate if possible. − Manually edit otherwise.
  • 19. Evaluation • Experimental results: o Using a text editor: 302 words/hour o Using MANDIAC: 618 words/hour • Using the text editor introduced typos.
  • 20. Acknowledgements • This project has been funded by the Qatar National Research Fund (grant NPRP 6-1020-1-199). • We also thank the annotators for their feedback on MANDIAC.