Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Natural Language Processing for Irish
1. Natural Language Processing
for Irish
Teresa Lynn, PhD
Research Fellow
ADAPT Centre
Dublin City University
The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
2. www.adaptcentre.ieOutline
o Overview of Natural Language Processing (NLP)
o English - Irish machine translation
o NLP for User-generated Content
o Importance of technology for Minority Languages
3. www.adaptcentre.ieWhat is Natural Language Processing?
“Using computers to analyse, derive meaning and understand text”
o ‘Attempt’ to understand how humans speak/ use language
4. www.adaptcentre.ieWhat is Natural Language Processing?
“Using computers to analyse, derive meaning and understand text”
o ‘Attempt’ to understand how humans speak/ use language
Why do computers need to understand language?
o Text summarisation
o Sentiment analysis
o Topic extraction (Information Retrieval)
o Grammar Checking
o Text Mining (Big Data problem)
o Machine Translation
o Question-Answering Systems
5. www.adaptcentre.ieChallenges of processing language
• Human languages are:
– Elegant
– Efficient
– Flexible
– Complex
• One word/sentence may mean many things
• Many ways of saying the same thing
• Meaning depends on context
• Literal and figurative language (metaphor)
• Language and culture (different ways of conceptualising
the same thing)
6. www.adaptcentre.ieAmbiguous Headlines
Syntactic ambiguity:
EYE DROPS OFF SHELF
SQUAD HELPS DOG BITE VICTIM
ENRAGED COW INJURES FARMER WITH AXE
STOLEN PAINTING FOUND BY TREE
Semantic Ambiguity
PANDA MATING FAILS; VETERINARIAN TAKES OVER
SAFETY EXPERTS SAY SCHOOL BUS PASSENGERS SHOULD BE BELTED
POLICE BEGIN CAMPAIGN TO RUN DOWN JAYWALKERS
Source: http://www.alta.asn.au/events/altss_w2003_proc/altss/courses/somers/headlines.htm
8. www.adaptcentre.ieWhat does a machine know about language?
Sentence = a string/sequence of characters:
“The man saw the boy with the telescope”
9. www.adaptcentre.ieWhat does a machine know about language?
Sentence = a string/ sequence of characters:
“The man saw the boy with the telescope”
Who is doing what? Who has the telescope?
11. www.adaptcentre.ieSyntactic Parsing 101
Who is doing what? Who has the telescope? = Parsing
“The man saw the boy with the telescope”
DET NOUN VERB DET NOUN PREP DET NOUN
Part-of-speech Tagging
14. www.adaptcentre.ieTraditional Approach – Rules
‘I like ice-cream in summer’
‘I like summer in ice-cream’ ….??
Syntactic Parsing Rules:
S NP VP
S NP VP PP
NP Noun | Pronoun
VP Verb NP | Verb PP
PP Preposition Noun
Noun ‘ice-cream’ | ‘summer’
Pronoun `I’
Verb `like’
Preposition ‘in’
15. www.adaptcentre.ieMachine Learning – data driven approaches
Supervised Machine Learning requires a LOT of:
• structured data
• annotated data
• reliable data
16. www.adaptcentre.ieMachine Learning – data driven approaches
Supervised Machine Learning requires a LOT of:
• structured data
• annotated data
• reliable data
19. www.adaptcentre.ieIrish – Long distance dependencies
VSO: Word Order
English: `The boy who was looking through the
telescope yesterday on the street saw the man’
Irish: Chonaic an buachaill a bhí ag feachaint
tríd an teileascóp inné ar an tsráid an fear sin
Lit Translation [Saw]v [the boy who was looking
through the telescope yesterday on the street]subj [the
man] obj
23. www.adaptcentre.ieOutline
o Overview of Natural Language Processing (NLP)
o English - Irish machine translation
o NLP for User-generated Content
o Importance of technology for Minority Languages
24. www.adaptcentre.ieUser-Generated Content & NLP
Where do we find UGC?
Blogs
Social Media sites
Micro-blogs (Twitter)
Informal Emails
What is difficult about UGC for NLP?
Unstructured Text
Ungrammatical
Text Speak
Difficult to predict
Various symbols (e.g. Emojis, Hashtags)
26. www.adaptcentre.ieMy Work – Minority Language Twitter
Code-switching
Diacritics
Verb drop
Spacing issue
Phonetic spelling
Abbreviations
grma -> go raibh maith agat
t7ain -> tseachtain
27. www.adaptcentre.ieMy Work – Minority Language Twitter
Goals:
o Build a corpus of POS-tagged Irish tweets
o Train a statistical POS tagger for Irish tweets
o Assess how we can leverage existing resources
o Examine the impact of noisy UG text on existing resources
31. www.adaptcentre.ieMapped POS tags
"<RT>"
"RT" ~
"<@NiallSF>"
"@NiallSF" @
"<:>"
":" ~
"<Sásta>"
"sásta" A
"<go>"
"go" T
"<raibh>"
"bí" V
"<sé>"
"sé" O
"<suaimhneach>"
"suaimhneach" A
32. www.adaptcentre.ieApplication of our work
Sociolinguistic studies
Improved automated translation of tweets
Improved sentiment analysis
Cross-lingual social media analysis
37. www.adaptcentre.ieConclusion
Harness technology to encourage language use:
o at school
o at home (phone technology, games)
o at work (through content creation tools, MT systems)
o online
Influence Government Policy with statistics gathered
through:
o online use analysis
o demand for technology
o empirically demonstrating evolution of language
39. www.adaptcentre.ieLanguage at Risk
Need to ensure continuing language usage
…….through technology
o Edutainment packages
o Word processing tools
o Webpage translation
o Search engines
o Games
o Social media
o Summarise discussions
o Monitor user sentiment
o Track misuse
Source: http://www.leuphana.de/institute/ies/llt2015.html