SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
Accentuate Us!
Kevin Scannell and Michael Schade
       November 10, 2010
Language Death
About 7000 languages spoken in the world
More than 90% are expected to disappear before 2100
Every language is a repository of the culture, traditions, and
world view of its community
The death of a language is an irrevocable loss, comparable
to the extinction of a plant or animal species
Many endangered language communities are looking to the
internet and technology to help revitalize their language
Endangered Language Technology
Only about 50 languages (0.71%) have fully-localized
desktop computer systems
Firefox 3.5 available in 68 languages (0.97%)
Spellcheckers for 117 languages (1.67%)
For this talk, we're looking at something even more basic:
keyboard input
Keyboard Input
The majority of languages are oral - no written tradition
Good news: among those that have writing systems, almost
all scripts are available in Unicode (but not Maldivian,
Khamti, ...)
Yet even Unicode-encoded languages often lack
appropriate input methods, or free fonts
When electronic texts do exist, they are often entered as
plain ASCII, either by transliteration (Cherokee,           →
galvquodiyu), omitting diacritics (Lingala, likɔngá → likonga),
or ad hoc approaches (Irish, béal → be/al)
Omitted diacritics matter! Leads to ambiguities,
misunderstandings (leite vs. léite).
Diacritic Restoration
Software that takes plain ASCII text in some language as
input, and outputs the text with all diacritics or extended
characters in place
Examples
Oll skulu vera fraels at hava sinar askodanir og bera taer fram uttan fordan →
Øll skulu vera fræls at hava sínar áskoðanir og bera tær fram uttan forðan
Uwe setin suen gha gu emwa ni sike rue ghae emwi esi ne uwe rue →
Uwẹ sẹtin suẹn gha gu emwa ni sikẹ ruẹ ghae emwi esi ne uwẹ ruẹ
Ua noa i na kanaka apau ke kuokoa o ka manao a me ka hoike ana i ka manao →
Ua noa i nā kānaka apau ke kūʻokoʻa o ka manaʻo  a me ka hōʻike ʻana i ka manaʻo
Tout moun gen dwa a libete lide yo ak lapawol yo →
Tout moun gen dwa a libète lide yo ak lapawòl yo
Moi nguoi deu co quyen tu do ngon luan va bay to quan diem →
Mọi người đều có quyền tự do ngôn luận và bầy tỏ quan điểm
Eni kookan lo ni eto si omi nira lati ni imoran ti o wu u, ki o si so iru imoran bee jade→
Ẹnì kọ̀ọ̀kan ló ní ẹ̀tọ́ sí òmì nira láti ní ìmọ̀ràn tí ó wù ú, kí ó sì sọ irú ìmọ̀ràn bẹ́ẹ̀ jáde
Statistical Machine Learning
Given an ASCII input, every character that allows a diacritic
or an extended form represents a "classification problem"
We use a machine learning approach; the program learns
where the diacritics belong by gathering statistics from a
"training corpus" of texts with the diacritics in place
Remembers words seen in training data; statistics on co-
occurring words to deal with ambiguous cases (Irish "leite"
vs. "léite" or even English "resume" vs. "résumé")
For never-before seen words, uses statistics of 3-character
sequences in a neighborhood of the character in question
(French initial "cera" vs. "cerc", "cabl" vs. "cabo"). This is
the generic case for under-resourced languages
Training texts crawled from the web; 114 in all!
API
Protocol: JSON
Calls
   langs
   lift
   feedback
Sample Call
   { "call": "charlifter.lift"
     , "lang": "ht"
     , "text": "Bon, la fe sa apre demen pito, le la we mwen
   andey."
     , "locale": "ht"
   }
Full documentation at http://accentuate.us/api
Service Architecture
Geographically Distributed
API Servers




Load-Balancing Proxy




Clients
HTTP Communication (Proxy)
Cache-Control: no-cache
Connection: keep-alive
Pragma: no-cache
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept-Encoding: gzip,deflate
Accept-Language: en-us,en;q=0.5
Host: ht.api.accentuate.us:8080
User-Agent: Accentuate.us/0.9b3 Mozilla/5.0 (Windows; U; Windows NT
6.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.1
Content-Length: 113
Content-Type: application/json; charset=utf-8
Keep-Alive: 115

{"call":"charlifter.lift","lang":"ht","text":"Bon, la fe sa apre demen pito, le la
we mwen andey.","locale":"ht"}
HTTP Communication (API)
Cache-Control: no-cache
Connection: close
Pragma: no-cache
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept-Encoding: gzip,deflate
Accept-Language: en-us,en;q=0.5
Host: ht
User-Agent: Accentuate.us/distribution
Content-Length: 113
Content-Type: application/json; charset=utf-8

{"call":"charlifter.lift","lang":"ht","text":"Bon, la fe sa apre demen pito, le la
we mwen andey.","locale":"ht"}
Clients
Perl
$ ./sf-client.pl -r -l ga -i "Is i an Ghaeilge an chead teanga oifigiuil."
Is í an Ghaeilge an chéad teanga oifigiúil.
Python
Vim
OS X Service
OpenOffice.org
Mozilla Firefox
UI and Localization Decisions
Implement API calls
   Langs
        Silent
   Feedback
        Opt-in
        Improve language models
   Lift
         Complicated!
Looking ahead
Demos
Thank You!

Weitere ähnliche Inhalte

Andere mochten auch

Музей интересных коллекций
Музей интересных коллекцийМузей интересных коллекций
Музей интересных коллекцийmguseva1
 
творческий отчёт 2008
творческий отчёт 2008творческий отчёт 2008
творческий отчёт 2008mguseva1
 
FRT Report 2016 Published-The Pulse of Technology
FRT Report 2016 Published-The Pulse of TechnologyFRT Report 2016 Published-The Pulse of Technology
FRT Report 2016 Published-The Pulse of TechnologyPeter Zehren, XMPA (LION)
 

Andere mochten auch (7)

Presentation1
Presentation1Presentation1
Presentation1
 
feasib
feasibfeasib
feasib
 
Музей интересных коллекций
Музей интересных коллекцийМузей интересных коллекций
Музей интересных коллекций
 
творческий отчёт 2008
творческий отчёт 2008творческий отчёт 2008
творческий отчёт 2008
 
Peter zehren ~ ftr 2015.power in palm
Peter zehren ~ ftr 2015.power in palmPeter zehren ~ ftr 2015.power in palm
Peter zehren ~ ftr 2015.power in palm
 
SMART Response PGO
SMART Response PGOSMART Response PGO
SMART Response PGO
 
FRT Report 2016 Published-The Pulse of Technology
FRT Report 2016 Published-The Pulse of TechnologyFRT Report 2016 Published-The Pulse of Technology
FRT Report 2016 Published-The Pulse of Technology
 

Ähnlich wie Accentuate Us!

The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationIconic Translation Machines
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of PatentsIconic Translation Machines
 
Intro computer fundamentals
Intro computer fundamentalsIntro computer fundamentals
Intro computer fundamentalsPrabhu Govind
 
Introduction to Google's Go programming language
Introduction to Google's Go programming languageIntroduction to Google's Go programming language
Introduction to Google's Go programming languageMario Castro Contreras
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionStephen Marquard
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Facultad de Informática UCM
 
RestMS Introduction
RestMS IntroductionRestMS Introduction
RestMS Introductionpieterh
 
introduction for computers
introduction for computersintroduction for computers
introduction for computersYogesh Chaure
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7alaa223
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Iconic Translation Machines
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
 
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Olga Lavrentieva
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In PythonMarwan Osman
 
Introduction to Speech Interfaces for Web Applications
Introduction to Speech Interfaces for Web ApplicationsIntroduction to Speech Interfaces for Web Applications
Introduction to Speech Interfaces for Web ApplicationsKevin Hakanson
 
Virtual eye vision with HoloLens
Virtual eye vision with HoloLensVirtual eye vision with HoloLens
Virtual eye vision with HoloLensStefano Tempesta
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdfSoha82
 

Ähnlich wie Accentuate Us! (20)

The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine Translation
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents
 
Intro computer fundamentals
Intro computer fundamentalsIntro computer fundamentals
Intro computer fundamentals
 
Introduction to Google's Go programming language
Introduction to Google's Go programming languageIntroduction to Google's Go programming language
Introduction to Google's Go programming language
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognition
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
 
Swift vs. Language X
Swift vs. Language XSwift vs. Language X
Swift vs. Language X
 
RestMS Introduction
RestMS IntroductionRestMS Introduction
RestMS Introduction
 
introduction for computers
introduction for computersintroduction for computers
introduction for computers
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7
 
Bringing UX to the Backend
Bringing UX to the BackendBringing UX to the Backend
Bringing UX to the Backend
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
Александр Ломов: "Reactjs + Haskell + Cloud Foundry = Love"
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In Python
 
Practical Kerberos
Practical KerberosPractical Kerberos
Practical Kerberos
 
Introduction to Speech Interfaces for Web Applications
Introduction to Speech Interfaces for Web ApplicationsIntroduction to Speech Interfaces for Web Applications
Introduction to Speech Interfaces for Web Applications
 
Virtual eye vision with HoloLens
Virtual eye vision with HoloLensVirtual eye vision with HoloLens
Virtual eye vision with HoloLens
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Intro compute
Intro computeIntro compute
Intro compute
 

Kürzlich hochgeladen

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Kürzlich hochgeladen (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Accentuate Us!

  • 1. Accentuate Us! Kevin Scannell and Michael Schade November 10, 2010
  • 2. Language Death About 7000 languages spoken in the world More than 90% are expected to disappear before 2100 Every language is a repository of the culture, traditions, and world view of its community The death of a language is an irrevocable loss, comparable to the extinction of a plant or animal species Many endangered language communities are looking to the internet and technology to help revitalize their language
  • 3. Endangered Language Technology Only about 50 languages (0.71%) have fully-localized desktop computer systems Firefox 3.5 available in 68 languages (0.97%) Spellcheckers for 117 languages (1.67%) For this talk, we're looking at something even more basic: keyboard input
  • 4. Keyboard Input The majority of languages are oral - no written tradition Good news: among those that have writing systems, almost all scripts are available in Unicode (but not Maldivian, Khamti, ...) Yet even Unicode-encoded languages often lack appropriate input methods, or free fonts When electronic texts do exist, they are often entered as plain ASCII, either by transliteration (Cherokee, → galvquodiyu), omitting diacritics (Lingala, likɔngá → likonga), or ad hoc approaches (Irish, béal → be/al) Omitted diacritics matter! Leads to ambiguities, misunderstandings (leite vs. léite).
  • 5. Diacritic Restoration Software that takes plain ASCII text in some language as input, and outputs the text with all diacritics or extended characters in place Examples Oll skulu vera fraels at hava sinar askodanir og bera taer fram uttan fordan → Øll skulu vera fræls at hava sínar áskoðanir og bera tær fram uttan forðan Uwe setin suen gha gu emwa ni sike rue ghae emwi esi ne uwe rue → Uwẹ sẹtin suẹn gha gu emwa ni sikẹ ruẹ ghae emwi esi ne uwẹ ruẹ Ua noa i na kanaka apau ke kuokoa o ka manao a me ka hoike ana i ka manao → Ua noa i nā kānaka apau ke kūʻokoʻa o ka manaʻo  a me ka hōʻike ʻana i ka manaʻo Tout moun gen dwa a libete lide yo ak lapawol yo → Tout moun gen dwa a libète lide yo ak lapawòl yo Moi nguoi deu co quyen tu do ngon luan va bay to quan diem → Mọi người đều có quyền tự do ngôn luận và bầy tỏ quan điểm Eni kookan lo ni eto si omi nira lati ni imoran ti o wu u, ki o si so iru imoran bee jade→ Ẹnì kọ̀ọ̀kan ló ní ẹ̀tọ́ sí òmì nira láti ní ìmọ̀ràn tí ó wù ú, kí ó sì sọ irú ìmọ̀ràn bẹ́ẹ̀ jáde
  • 6. Statistical Machine Learning Given an ASCII input, every character that allows a diacritic or an extended form represents a "classification problem" We use a machine learning approach; the program learns where the diacritics belong by gathering statistics from a "training corpus" of texts with the diacritics in place Remembers words seen in training data; statistics on co- occurring words to deal with ambiguous cases (Irish "leite" vs. "léite" or even English "resume" vs. "résumé") For never-before seen words, uses statistics of 3-character sequences in a neighborhood of the character in question (French initial "cera" vs. "cerc", "cabl" vs. "cabo"). This is the generic case for under-resourced languages Training texts crawled from the web; 114 in all!
  • 7. API Protocol: JSON Calls langs lift feedback Sample Call { "call": "charlifter.lift" , "lang": "ht" , "text": "Bon, la fe sa apre demen pito, le la we mwen andey." , "locale": "ht" } Full documentation at http://accentuate.us/api
  • 8. Service Architecture Geographically Distributed API Servers Load-Balancing Proxy Clients
  • 9. HTTP Communication (Proxy) Cache-Control: no-cache Connection: keep-alive Pragma: no-cache Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Accept-Encoding: gzip,deflate Accept-Language: en-us,en;q=0.5 Host: ht.api.accentuate.us:8080 User-Agent: Accentuate.us/0.9b3 Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.1 Content-Length: 113 Content-Type: application/json; charset=utf-8 Keep-Alive: 115 {"call":"charlifter.lift","lang":"ht","text":"Bon, la fe sa apre demen pito, le la we mwen andey.","locale":"ht"}
  • 10. HTTP Communication (API) Cache-Control: no-cache Connection: close Pragma: no-cache Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Accept-Encoding: gzip,deflate Accept-Language: en-us,en;q=0.5 Host: ht User-Agent: Accentuate.us/distribution Content-Length: 113 Content-Type: application/json; charset=utf-8 {"call":"charlifter.lift","lang":"ht","text":"Bon, la fe sa apre demen pito, le la we mwen andey.","locale":"ht"}
  • 11. Clients Perl $ ./sf-client.pl -r -l ga -i "Is i an Ghaeilge an chead teanga oifigiuil." Is í an Ghaeilge an chéad teanga oifigiúil. Python Vim OS X Service OpenOffice.org
  • 12. Mozilla Firefox UI and Localization Decisions Implement API calls Langs Silent Feedback Opt-in Improve language models Lift Complicated! Looking ahead
  • 13. Demos