ICT role in 21st century education and its challenges
Building Blocks for Accessing Multilingual Data: CLDR
1. Building Blocks for Accessing
Multilingual Data: CLDR
Steven R. Loomis, IBM GFTT 1
2. Access available handouts at ala.15.ala.org/sessions/handouts.
About Me
• Senior Software Engineer,
IBM Global Foundations Technology Team
• IBM’s technical lead for the ICU4C/C++
software library, and primary voting
representative to Unicode
• Member of CLDR-TC, lead of ULI-TC
2
3. Access available handouts at ala.15.ala.org/sessions/handouts.
Agenda
• About CLDR
• Focus Areas:
• Language Identification
• Transliteration
• Searching and Sorting
• Keyboards/Entry
• Q&A
3
4. Access available handouts at ala.15.ala.org/sessions/handouts.
What is CLDR?
• Common Locale Data Repository
• Language and region-specific data
• Covers hundreds of language/region pairs
• Open data (like Unicode itself), XML/JSON
format
• Community input, carefully curated
4
5. Access available handouts at ala.15.ala.org/sessions/handouts.
Who is CLDR?
• CLDR’s Technical Committee,
the CLDR-TC, is part of the Unicode
Consortium
• Active participation by industry, academic,
open source projects, national standards
bodies, individuals
5
6. Access available handouts at ala.15.ala.org/sessions/handouts.
Who uses CLDR?
• Apple, Google, IBM, Microsoft…
• Wikimedia foundation, jQuery, …
• Java, node.js, php, …
• Many users via ICU C/C++/Java library
6
7. Access available handouts at ala.15.ala.org/sessions/handouts.
Locale Data
• Data required for respecting the
linguistic, cultural, geopolitical
requirements of specific users
• Example: "What day is it?"
7
8. Access available handouts at ala.15.ala.org/sessions/handouts.
XML / JSON
• XML: “es-US”
• <month type="6">Junio</month>
• JSON: “es-US”
• { …
"6": "Junio", …
}
8
9. Access available handouts at ala.15.ala.org/sessions/handouts.
CLDR Coverage
• Coverage vs. number of languages
9
10. Access available handouts at ala.15.ala.org/sessions/handouts.
CLDR site and SurveyTool (DEMO)
• DEMO:
• http://unicode.org/cldr
• http://st.unicode.org/cldr-apps
10
11. Access available handouts at ala.15.ala.org/sessions/handouts.
Locale Identifiers — BCP47
• Example: sr-Latn-RS
• sr : ISO-639 "Serbian"
• Latn : ISO-15924 "Latin Script"
(vs Cyrillic)
• RS : ISO 3166 / UN M.49 "Serbia"
Latn
Latnsr
Latn
LatnLatn
Latn
LatnRS
11
12. Access available handouts at ala.15.ala.org/sessions/handouts.
Language/Territory/Script info
Facts:
• “The Cyrillic Script can be used to write
Mongolian, Russian, Serbian…”
• “Italian is spoken in Italy, San Marino,
Switzerland…”
12
13. Access available handouts at ala.15.ala.org/sessions/handouts.
Language Identification: Exemplars
English
(Latin)
a b c d e f g h i j k l m
n o p q r s t u v w x y z
Serbian
(Latin)
a b c ć č d đ dž e f g h i j k l lj m
n nj o p r s š t u v z ž
Serbian
(Cyrillic)
а б в г д ђ е ж з и ј к л љ м н њ о п р
с т ћ у ф х ц ч џ ш
Russian
(Cyrillic)
а б в г д е ё ж з и й к л м н о п р
с т у ф х ц ч ш щ ъ ы ь э ю я
13
14. Access available handouts at ala.15.ala.org/sessions/handouts.
Transliteration
• Existing data for rule sets.
• ALA-LC format could be included.
• Rule based engine.
14
15. Access available handouts at ala.15.ala.org/sessions/handouts.
Transliteration Rule Example: Greek
• <tRule>Σ ↔ S ;</tRule>
• <tRule>τ ↔ t ;</tRule>
• <tRule>Τ ↔ T ;</tRule>
15
16. Access available handouts at ala.15.ala.org/sessions/handouts.
Demo: ICU transliterator demo
• http://demo.icu-project.org/icu-bin/
translit
16
17. Access available handouts at ala.15.ala.org/sessions/handouts.
Searching and Sorting
• Unicode (UCA) provides base
• CLDR “tailors”:
English vs. Danish vs. French
• German: Mueller = Müller = MUELLER
• Multiple stages and options:
• blackbird vs black-bird vs BlackBird
17
18. Access available handouts at ala.15.ala.org/sessions/handouts.
Demo: Collator
• http://demo.icu-project.org/icu-bin/
collation.html
18
19. Access available handouts at ala.15.ala.org/sessions/handouts.
Keyboards / Entry
• Standardized
identifier for
keyboard tables
• Allows comparison
between keyboard
providers
19
20. Access available handouts at ala.15.ala.org/sessions/handouts.
Demo: MARC processor
CLDR
data
Script: Armn (Armenian)
Exemplar text matches hy
“Armenian”
Transliterate to latin:
“Hayastaneayc‘ ekeġec‘i”
Regions where spoken:
Armenia, Russia, Georgia,
Syria, Lebanon, Iran,
Turkey, Cyprus
20
uses: CLDR, ICU4J, MARC4J
21. Access available handouts at ala.15.ala.org/sessions/handouts.
Thank You / Q&A
• srloomis@us.ibm.com
• @srl295 ( Twitter, GitHub, Freenode )
• ibm.biz/srloomis
21