What's New in Teams Calling, Meetings and Devices March 2024
Short Text Language Detection with Infinity-Gram
1. Short Text Language Detection
with Infinity-Gram
2012/05/14 NAIST Seminar
Nakatani Shuyo @ Cybozu Labs Inc
2. Agenda
• Language Detection
• Proposal Method
– Maximal Substring
• Corpus
• Implementation and Estimations
• Conclusions
Short Text Language Detection with Infinity-Gram
4
(NAIST Seminar)
3. Language Detection
Short Text Language Detection with
5
Infinity-Gram (NAIST Seminar)
4. In What Language?
• Ik kan er nooit tegen als mensen me negeren.
• Aha ich seh angeblich süß aus
• Czy mógłbym zasnąć w przedmieściach Twoich myśli?
• Ah. Tak. Så skal jeg bare finde ud af *hvordan*!
• Det er ikke så digg nei å vi som har finale til helga....Skrekk og
gru! Takk :)
• tack kompis! Hade faktiskt tänkt maila dig på fb och fråga vart
du tog vägen!
• Çok doğru. En büyük hatayı yaptım.
• Încântat de cunoștință.
• Một người dân bị thương và bốn người mất tích sau khi một
ngọn núi lửa ở miền trung... Detection with Infinity-Gram
Short Text Language
6
(NAIST Seminar)
5. Hints
• Dutch if there is 'ik'
• German if there is 'ich' or a letter 'ß'
• Polish if there is 'czy' or letters 'Ł', 'ń', 'ś' or 'ź'
• Scandinavian if there is a letter 'å'
– Danish if there is 'af.' 'Tak' means 'thanks.'
– Norwegian if there is 'nei.' 'Takk' means 'thanks.'
– Swedish if there is "och." 'Tack' means 'thanks.'
• Turkish if there is a letter 'ı' ( 'i' without point) or 'ğ'
• Romanian if there is a letter 'ă' or 'ș' or 'ț'
– Although 'ă' is also used in Vietnamese, it is easy to distinguish them.
– Although 'ş' is also used in Turkish, it is easy to distinguish them.
• Vietnamese if there are many unreadable letters on WinXP :P
Short Text Language Detection with Infinity-Gram
7
(NAIST Seminar)
6. In What Language? (Solution)
• Ik kan er nooit tegen als mensen me negeren. Dutch
• Aha ich seh angeblich süß aus German
• Czy mógłbym zasnąć w przedmieściach Twoich myśli? Polish
• Ah. Tak. Så skal jeg bare finde ud af *hvordan*! Danish
• Det er ikke så digg nei å vi som har finale til helga....Skrekk og
gru! Takk :) Norwegian
• tack kompis! Hade faktiskt tänkt maila dig på fb och fråga vart
du tog vägen! Swedish
• Çok doğru. En büyük hatayı yaptım. Turkish
• Încântat de cunoștință. Rumanian
• Một người dân bị thương và bốn người mất tích sau khi một
ngọn núi lửa ở miền trung... Detection with Infinity-Gram
Short Text Language
Vietnamese
8
(NAIST Seminar)
7. What's Language Detection
• To detect what language the input text written in
– Time fries like arrow → English
– Buona sera! → Italian
• It is prior for many language processing tasks
– Language model is built for each language
– Text search, classification, extraction, translation, ...
• It is possible to detect for long enough and
noiseless text with more than 99% accuracy
[Cavnar+ 94]
– 3-gram model is used in many methods
Short Text Language Detection with Infinity-Gram
9
(NAIST Seminar)
8. SPAM or not?
• It is necessary to know that it is written in Polish.
Short Text Language Detection with Infinity-Gram
10
(NAIST Seminar)
9. Document Categorization
with Naive Bayes Classifier
• Categorize a document 𝑋 = (𝑋 𝑖 ) into category 𝐶 𝑘
– A document 𝑋 is represented as collection of words
𝑋 𝑖 (bag-of-words)
• Word probability assumes conditionally independent on
each category
– 𝑝 𝑋 𝐶𝑘 = 𝑖 𝑝 𝑋 𝑖 𝐶k (from independent hypothesis)
– where 𝑝(𝑋 𝑖 |𝐶) : rate of word frequency for category
• Estimate the category 𝐶k to maximize posterior
𝑝 𝑋 𝐶k 𝑝 𝐶k
– 𝑝 𝐶k 𝑋 = ∝ 𝑝(𝐶k ) 𝑖 𝑝(𝑋 𝑖 |𝐶k )
𝑝 𝑋
– where 𝑝(𝐶k ) : prior for category
Short Text Language Detection with Infinity-Gram
11
(NAIST Seminar)
10. Language Detection
with Naive Bayes Classifier
• Document categorization with language
labels
– Categorize documents into 'English', 'Japanese'
and so on
• Use character n-gram as features
– "Unicode code point n-gram", strictly speaking
– Assume character encoding of the document is
already known
• Most applications know encoding of inside text data
Short Text Language Detection with Infinity-Gram
12
(NAIST Seminar)
11. Why Use n-Gram to Detect Language
• Each language has proper characters and spelling rules
– “é” is often used in Spanish, Italian and so on, but not in English
in principle
– There are many words which start with “Z” in German, but not
in English
– There are many words which start with “C” in English, but not in
German
– Spelling “Th” is often used in English, but not in the other
languages
□C □L □Z Th
□T h i s □ English 0.75 0.47 0.02 0.74
T h i s ←1-gram German 0.10 0.37 0.53 0.03
□T Th hi is s□ ←2-gram
French 0.38 0.69 0.01 0.01
□Th Thi his is□ ←3-gram
Short Text Language Detection with Infinity-Gram
13
(NAIST Seminar)
12. language-detection(langdetect)
(Nakatani 2010)
• Language detection library for Java
– http://code.google.com/p/language-detection/
– Apache License 2.0
– Character 3-gram + Bayesian filter
– Various normalizations + Feature sampling
• 99% over precision for 53 languages
– Training with Wikipedia abstract
– Widely support including Asian languages
– Adopted by Apache Solr
Short Text Language Detection with Infinity-Gram
14
(NAIST Seminar)
13. Estimation with News Text
Language size accuracy Language size accuracy
af Afrikaans 200 199 (99.50%) mr Marathi 200 200 (100.00%)
ar Arabic 200 200 (100.00%) ne Nepali 200 200 (100.00%)
bg Bulgarian 200 200 (100.00%) nl Dutch 200 200 (100.00%)
bn Bengali 200 200 (100.00%) no Norwegian 200 199 (99.50%)
cs Czech 200 200 (100.00%) pa Punjabi 200 200 (100.00%)
da Dannish 200 179 (89.50%) pl Polish 200 200 (100.00%)
de German 200 200 (100.00%) pt Portuguese 200 200 (100.00%)
el Greek 200 200 (100.00%) ro Romanian 200 200 (100.00%)
en English 200 200 (100.00%) ru Russian 200 200 (100.00%)
es Spanish 200 200 (100.00%) sk Slovak 200 200 (100.00%)
fa Persian 200 200 (100.00%) so Somali 200 200 (100.00%)
fi Finnish 200 200 (100.00%) sq Albanian 200 200 (100.00%)
fr French 200 200 (100.00%) sv Swedish 200 200 (100.00%)
gu Gujarati 200 200 (100.00%) sw Swahili 200 200 (100.00%)
he Hebrew 200 200 (100.00%) ta Tamil 200 200 (100.00%)
hi Hindi 200 200 (100.00%) te Telugu 200 200 (100.00%)
hr Croatian 200 200 (100.00%) th Thai 200 200 (100.00%)
hu Hungarian 200 200 (100.00%) tl Tagalog 200 200 (100.00%)
id Indonesian 200 200 (100.00%) tr Turkish 200 200 (100.00%)
it Italian 200 200 (100.00%) uk Ukrainian 200 200 (100.00%)
ja Japanese 200 200 (100.00%) ur Urdu 200 200 (100.00%)
kn Kannada 200 200 (100.00%) vi Vietnamese 200 200 (100.00%)
ko Korean 200 200 (100.00%) zh-cn Simplified Chinese 200 200 (100.00%)
mk Macedonian 200 200 (100.00%) zh-tw Traditional Chinese 200 200 (100.00%)
ml Malayalam 200 200 (100.00%) total 9800 9777 (99.77%)
• Test for crawled news text from web in 49 languages
Short Text Language Detection with Infinity-Gram
15
(NAIST Seminar)
14. Estimation with Europarl datasets
language size correct accuracy
bg Bulgarian 1000 988 98.8% • Test for 1000 samples for each
cs Czech 1000 994 99.4%
da Dannish 1000 968 96.8% language from Europarl Parallel Corpus
de German 1000 998 99.8%
– from the proceedings of the European Parliament
el Greek 1000 1000 100.0%
en English 1000 996 99.6% – http://www.statmt.org/europarl/
es Spanish 1000 996 99.6%
et Estonian 1000 996 99.6% • http://code.google.com/p/language-
fi Finnish 1000 998 99.8% detection/downloads/detail?name=eur
fr French 1000 999 99.9%
hu Hungarian 1000 999 99.9% oparl-test.zip
it Italian 1000 999 99.9%
lt Lithuanian 1000 997 99.7%
lv Latvian 1000 999 99.9%
nl Dutch 1000 974 97.4%
pl Polish 1000 999 99.9%
pt Portuguese 1000 996 99.6%
ro Romanian 1000 999 99.9%
sk Slovak 1000 988 98.8%
sl Slovene 1000 976 97.6%
sv Swedish 1000 991 99.1%
total 21000 20850 99.3%
Short Text Language Detection with Infinity-Gram
16
(NAIST Seminar)
16. We still have ENEMY to beat!
Short Text Language Detection with Infinity-Gram
18
(NAIST Seminar)
17. Twitter Language Detection
with the Existing Methods
• Only 90-95% accuracy
language LD CLD Tika
ca Catalan 95.3 93.0 83.8
for tweet corpus
cs Czech 96.3 96.6 ----
da Dannish 94.5 90.7 58.7
de German 86.6 96.8 73.1
en English 88.3 97.4 54.7
es Spanish 91.5 90.5 44.4 • LD = language-detection
fi Finnish 98.9 99.4 94.8
fr French 95.0 94.5 67.4 • CLD = Chromium Compact Language
hu Hungarian 85.8 89.0 76.2 Detection
id Indonesian 89.7 92.8 ----
it Italian 96.2 93.8 87.1 – http://code.google.com/p/chromium-
nl Dutch 69.5 93.2 65.0 compact-language-detector/
no Norwegian 96.0 74.9 68.6
– regard ms(Malay) as id(Indonesian)
pl Polish 98.0 97.8 88.8
pt Portuguese 88.0 88.6 47.4 • Tika = Apache Tika
ro Romanian 92.8 96.1 82.6
sv Swedish 96.0 96.4 75.6 – http://tika.apache.org/
tr Turkish 97.6 97.4 ---- – Estimate on 15 languages which Tika
vi Vietnamese 98.7 98.9 ---- supports in our tweet corpus
total 92.2 93.8 70.0
Short Text Language Detection with Infinity-Gram
19
(NAIST Seminar)
18. Chromium Compact Language Detection
(CLD)
• Porting the language detector from
Google Chromium
– http://code.google.com/p/chromium-compact-language-detector/
– Implementation in C++, Python binding
– # of supported languages : CLD = 76,
langdetect = 53
– Accuracy : CLD = 98.82%, langdetect =
99.22%
• for 17 languages on Europarl datasets
• http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
Short Text Language Detection with Infinity-Gram
20
(NAIST Seminar)
19. Is twitter Language Detection difficult? (1)
• Tweet is too short to extract 3-gram features
– At most 140 characters on twitter
– URLs, mentions and hashtags are not useful to
detect
• LIGA [Tromp+ 11]
– Graph-features based on 3-gram
• Add long distance features
• 95~98% accuracy for twitter Language Detection
• 6 languages (de, en, es, fr, it, nl)
Short Text Language Detection with Infinity-Gram
21
(NAIST Seminar)
20. Is twitter Language Detection difficult? (2)
• Tweet is too noisy
– Representations against the language's orthography often
appear
– Acronym, Abbreviation, lengthened word (like 'Cooooolll')
• Likelihood of tweet tends to get smaller on normal
language model
OMG Oh My God u you
LOL Laughing Out Loud ur your Letter 'k' isn't
used in Italian
LMAO Laughing My Ass Out 4 for
F4F Follow for Follow i0u I love you
MDR Mort de Rire (French) k che (Italian)
TKT Ne t‘Inquiète Pas (Fr) anke anche(Italian)
Short Text Language Detection with Infinity-Gram
22
(NAIST Seminar)
21. Motivation to Detect Short Text Language
• There are many small chunks of text in addition
to twitter
– Schedule, search query, bulletin board and so on
– There are many questions about short text detection
in the Issues Board of langdetect Project
• http://code.google.com/p/language-detection/issues/detail?id=10
• Detection for multi-language mixed text
– Cut the target document in paragraphs or lines
– Detect for each short text
Short Text Language Detection with Infinity-Gram
23
(NAIST Seminar)
22. Our Goal
• Over 99% accuracy
– However it is too difficult to detect "one
word sentence"...
– Our Goal is 99%+ accurate detection for
"sentence with more than 3 words"
Short Text Language Detection with Infinity-Gram
24
(NAIST Seminar)
23. We need
• Rich feature extractable model from
short text,
– Maximal substring model
(∞-gram Logistic Regression)
• and twitter-specific Language model
or Corpus to construct it.
– about 700K tweet corpus with language
label
Short Text Language Detection with Infinity-Gram
25
(NAIST Seminar)
24. Proposal Method
Short Text Language Detection with
26
Infinity-Gram (NAIST Seminar)
25. How to increase features from 3-grams
# of n-gram
gram
freq≧1 freq≧2 freq≧10 • The more n, the
1 79 72 57 more features
2 1896 1533 902
3 15970 10369 4525
• Maximum at
4 64966 33941 10534 n=∞, that is all
5 167543 69719 15538 substring
6 323749 107861 18970
– But it has O(T2)
7 524634 142954 21093
order
8 760719 171995 22159
9 921361 193995 22696
: : : :
※ cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
Short Text Language Detection with Infinity-Gram
27
(NAIST Seminar)
26. Text Categorization with All Substring Features
[Okanohara+ 09]
• Multiclass Logistic Regression using all
substrings as features
– Maximal Substring makes the equivalent
model that can be constructed in linear
time
– Store features into TRIE, fast prediction
Short Text Language Detection with Infinity-Gram
28
(NAIST Seminar)
27. Maximal Substring (1)
• Define a containment(semi-order)
among non empty substrings
abracadabra
– “ra” ⊂ “bra“ ⇔ all ”ra” occur
as the substring of “bra”
– “a” ⊄ “ra“ ⇔ “a” occur in not only “ra“
but also “ca”
※It is strictly defined with also its position in the substring.
Short Text Language Detection with Infinity-Gram
29
(NAIST Seminar)
28. Maximal Substring (2)
via http://d.hatena.ne.jp/nokuno/20120203/1328237067
• Each equivalent class formed by the containment
relationship has a unique maximal element, that is
named "Maximal Substring".
• Maximal substrings of "abracadabra" are "a", "abra"
and "abracadabra".
Short Text Language Detection with Infinity-Gram
30
(NAIST Seminar)
29. Maximal Substring and Infinity-Gram
• Frequencies of substrings that have a
containment relationship always equal.
• In the model with linear combination of
features, it is possible to enclose the common
feature values.
• Logistic regression with maximal substrings is
equivalent to the one with infinity-grams.
※ Although the equivalence collapses for test set,
we assumes that it can be approximated by a sufficiently large training set.
Short Text Language Detection with Infinity-Gram
31
(NAIST Seminar)
30. Extended Suffix Array
• Extended Suffix Array consists of
– SA=Suffix Array,
– L=Longest Common Prefixes,
– B=Burrows-Wheeler's Transformed text.
• A maximal substring that occurs more than once corresponds
to a internal node of Suffix Tree, which is equivalent to a
suffix with L>0 and BWT has more than 1 character type.
– They can be calculated on linear time.
• esaxx : Okanohara's implement of ESA
– http://code.google.com/p/esaxx/
Short Text Language Detection with Infinity-Gram via [Okanohara+ 09]
32
(NAIST Seminar)
32. Target Languages
• Limit character type to detect
– In short text detection, mixed text can be
divided to type of characters
• Latin alphabet language
– The most difficult alphabet type to detect
– Languages which speakers are over 5
million are more than 25.
Short Text Language Detection with Infinity-Gram
34
(NAIST Seminar)
33. What's Latin Alphabet?
• Latin alphabet ≠ ascii alphabet
– å, ą, æ, ð, Ħ, ŋ and so on...
• They are assigned to 9 code blocks in Unicode
Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered
U+0100-017F Latin Extended-A with these.
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These aren’t used by almost
U+A720-A7FF Latin Extended-D all present languages
Short Text Language Detection with Infinity-Gram
35
(NAIST Seminar)
34. Latin Alphabets
in Unicode Codepoint Chart
use often use sometimes for Vietnamese only
Short Text Language Detection with Infinity-Gram
36
(NAIST Seminar)
35. How to Create Corpus
• Collect tweets with 'sample' method of
twitter Streaming API
– Sampling 1% of all tweets (about 2
million tweets).
– Tweets in Latin alphabet language
account for 60% of them.
• The rest is only to annotate language
labels to these tweets
Short Text Language Detection with Infinity-Gram
37
(NAIST Seminar)
36. Language Label Annotation
• Group tweets by their timezone
– French tweets account for about 1% of all ones
– But they account for 50% of ones in Paris
timezone only
• Annotate tentative labels to tweets using
langdetect
– Remove non-French tweets from ones labeled ‘fr’
– Recover French tweets from ones not labeled ‘fr’
(※ 20% of the whole tweets have no timezone)
Short Text Language Detection with Infinity-Gram
38
(NAIST Seminar)
37. How to annotate
Swedish, Norwegian, Danish, Vietnamese, Lithuanian,
Czech, Hungarian, Catalan, Rumanian and Polish guides in turn
Short Text Language Detection with Infinity-Gram
39
(NAIST Seminar)
38. Created Corpus
language training test
ca Catalan 9,089 5,082
cs Czech 9,082 7,682
da Dannish 7,388 5,524
de German 44,448 10,065
en English 44,520 10,168
es Spanish 44,118 10,265
fi Finnish 8,087 7,050
fr French 44,339 10,098
hu Hungarian 10,030 4,904
id Indonesian 44,722 10,181
it Italian 43,366 10,152
nl Dutch 44,682 10,007 • Noiseless tweets for training
no Norwegian 10,124 8,496 data
pl Polish 16,771 10,152
pt Portuguese 44,215 10,208 • Noiseful tweets with more
ro Romanian 10,021 5,911 than 3 words as test data
sv Swedish 44,054 10,032
tr Turkish 44,703 10,308 • Work with Raúl Velaz and
vi Vietnamese
total
15,030
538,789
10,488
166,773
Hiroshi Manabe for Catalan
corpus creation
Short Text Language Detection with Infinity-Gram
40
(NAIST Seminar)
39. Simple Language Detection
• Language detector can be constructed
from maximal substring model and
twitter corpus
– It still gets at most 98% accuracy.
• We guess it is necessary to reduce bias.
– data size bias
– language-specific bias
– twitter-specific bias
Short Text Language Detection with Infinity-Gram
41
(NAIST Seminar)
40. Bias by Data Size
• Tweet size in each language has huge bias.
• Level them out by sampling with replacement
from each language up to the largest data
– It actually approximates to copy the integer multiple
of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
others
Short Text Language Detection with Infinity-Gram
42
(NAIST Seminar)
41. Convert to Lowercase
on Multiple Languages
• Conversion into lower case saves corpus and
compresses model.
• But the lower case of I (U+0049) in Turkish
differs from others.
• Convert to lower case excluding ‘I’
Upper case Lower case
Turkish I (U+0049) ı (U+0131)
Azerbaijani
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069)
Short Text Language Detection with Infinity-Gram
43
(NAIST Seminar)
42. Normalization for Rumanian
• Rumanian uses â, ă, î, ș, ț in addition to a-z
• There are 2 character type as s/t with a “beard”
– U+015E-F, U+0162-3 : s/t with cedilla
– U+0218-B : s/t with comma below
• ‘s/t with cedilla’ is more popular on news, twitter and Wikipedia.
• The 2 code has the same design in some fonts...
– Indistinguishable!!
ș ş ț ţ
U+0219 U+015F U+021B U+0163
Short Text Language Detection with Infinity-Gram 44
(NAIST Seminar)
43. Rumanian Character Affairs on PC
• Although Romanian orthography provided that ‘s/t
with comma’ must be used, they was not available
to PC until recently.
– 1989 Democratization in Rumania
– 2001 ‘s/t with comma’ was provided by ISO8859-16(Latin-10) and Unicode
– 2007 Rumania seated in the EU
– 2007 Windows Vista supported ‘s/t with comma’ (available for everyone!)
‘s/t with cedilla’ is used
on an advertisement board
in Bucharest
Short Text Language Detection with Infinity-Gram
45
(NAIST Seminar)
44. Normalization
for Substitute Characters
• ‘s/t with cedilla’ are substitute characters
– But they are more popular than the others
– with cedilla : with comma = 2 : 1
– “Rumanian IME” outputs the substitutes too :D
• Regard ‘s/t with comma’ as ‘s/t with cedilla’
ț ţ
I reckon it is similar to
the relationship of
Japanese character ‘SA’!!
U+021B Short Text ささ
U+0163 Language Detection with Infinity-Gram
(NAIST Seminar)
46
45. Arabic Character Normalization
(on language-detection)
• Arabic and Persian have the similar trouble too.
• Character ‘yeh’ in Farsi corresponds to 2 code points.
– Wikipedia uses ( یU+06cc, Farsi yeh) only
– News uses (يU+064a, Arabic yeh) only
• U+064a is a substitute in Farsi
– The popular Arabic charset CP-1256 has no character
mapped into U+06cc
– As ‘yeh’ is very often used in both languages, quite all
Persian text detection fails
• Regard U+06cc as U+064a
Short Text Language Detection with Infinity-Gram
47
(NAIST Seminar)
46. Normalization for Vietnamese (1)
• Vietnamese has 12 vowels
– a, ă, â, e, ê, i, y, o, ô, ơ, u, ư
• Vietnamese has 6 tones
– a, ả, à, ã, á, ạ
– These tone symbols are used also in
general documents like news.
• The tone symbols can be appended to
all vowels
– 12 * 6 = 72
Short Text Language Detection with Infinity-Gram
48
(NAIST Seminar)
47. Normalization for Vietnamese (2)
• Representation of vowels with
tones
1. Use U+1ea0 - U+1ef9
• ẵ = U+1eb5
2. Combine with Diacritical Marks
• ẵ = U+0103 U+0303
– Half and half on news and tweet
• Normalize 2 into 1
Short Text Language Detection with Infinity-Gram
49
(NAIST Seminar)
48. CJK-Kanji Normalization (1)
(on language-detection)
• CJK-Kanji has too many characters(more than 20K)
– Other character types have only 30-50 characters.
• The character space is very sparse.
– Characters that don’t occur in the training corpus have no
probabilities.
• e.g. "谢谢", Kanji for person name
– Common frequent characters are too strong.
• e.g. : a text which has ”的” tends to be detected as Traditional
Chinese
• Hence Kana is used in Japanese too, the probabilities of Kanji in
Japanese are less than ones in Chinese.
Short Text Language Detection with Infinity-Gram
50
(NAIST Seminar)
49. CJK-Kanji Normalization (2)
(on language-detection)
• Group Kanjis by frequency and normalize each group to the
representative character
– (1) K-means clustering
• Use tf-idf on Wikipedia and Google News
• K=50 (size of ascii alphabet = 52)
– (2) “Commonly Used Kanji” provided in Japanese and Chinese
• Simplified Chinese : 现代汉语常用字表(3500)
• Traditional Chinese :常用国字標準字体表(4808)
⊂ Big5 the first standard(5401)
• Japanese : 常用漢字(2136)∪ JIS the first standard(2965) = 2998
– 常用漢字 doesn’t have Kanji for person name and place name very much
• Generate 130 clusters from product of (1) and (2)
Short Text Language Detection with Infinity-Gram
51
(NAIST Seminar)
50. Normalization for twitter
• Remove simply
– URL
– mention
– hash tag
– RT
– face mark using alphabet like XD, :p
Short Text Language Detection with Infinity-Gram
52
(NAIST Seminar)
51. Normalization for
twitter-Specific Representation
• How to Like ‘coooooooollllll’
• Case 1: Make a normalization dictionary using [Brody+
2011]
– Unsupervised normalization like coooollll → cool
– It can’t handle words that are not in the dictionary
• Case 2: If the same character continues in more than 3,
Shrink it to 2
– There is no language which over 3 continuation of the
same Latin alphabet in orthography of.
• If in Japanese, there are “かたたたき”, “かわいいいぬ”, “あわてて
て” and so on.
• Acronym (like WWW, СССР) is not useful for language detection
Short Text Language Detection with Infinity-Gram
53
(NAIST Seminar)
52. Laugh Normalization
• There are various laughs on each language
– HOW MUCH DO YOU LOVE COACH BEISTE???
HHAHAHAHAHAH
– Hihihihi. :) Habe ich regulär 2x die Woche!
– Tafil con eso...!!! Jajajajajajaja
– Malo?? Jejejeje XP
– kekeke chỗ đó làm áo được ko em?
• Shrink them to double
– hahahha ⇒ haha
Short Text Language Detection with Infinity-Gram
54
(NAIST Seminar)
54. Language Detection with Infinity-Gram
(ldig)
• tweet language detection for Latin
alphabet
– https://github.com/shuyo/ldig
• MIT license
• Distribute also the trained model here
– ∞-gram LR(maximal substring) [Okanohara+ 09]
– L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]
– Double Array
Short Text Language Detection with Infinity-Gram
56
(NAIST Seminar)
55. Usage (1) Model Initialization
• ldig.py -m [model] --init [corpus]
-x [maximal string extractor]
--ff=[lower limit of frequency]
– Extract features from corpus and initialize
model
– -m : model directory
– -x : path of maximal substring extractor
(execute as external process)
– --ff : Ignore less than the specified value
Short Text Language Detection with Infinity-Gram
57
(NAIST Seminar)
56. Maximal String Extractor
• maxsubst [input file] [output file]
– Input as multiple line text
• Replace TABs to “ “, line feeds to U+0001 in it
– Output as ”[features]¥t[frequency]”
Short Text Language Detection with Infinity-Gram
58
(NAIST Seminar)
57. Usage (2) Learn
• ldig.py -m [model] --learning [corpus]
-e [learning rate] -r [regularizer]
--wr=[whole regularization]
– Learn the model using the corpus on 1 cycle of SGD
– -e : learning rate of SGD
– -r : regularizer of L1 regularization
– --wr : what times to regularize for whole parameters
• Parameters are too many to regularize the whle ones
every step
Short Text Language Detection with Infinity-Gram
59
(NAIST Seminar)
58. Usage (3) Shrink Model
• ldig.py -m [model] --shrink
– Remove Unefficient features(all
parameters of which are 0) from the
model
Short Text Language Detection with Infinity-Gram
60
(NAIST Seminar)
59. Usage (4) Detect Language
• ldig.py -m [model] [test data]
– Detect languages of test data and output
its result and summary
Short Text Language Detection with Infinity-Gram
61
(NAIST Seminar)
60. Data Format
• Training and test data
– [correct label]¥t[meta data]¥t[text]
en u should just enjoy ur vacation sadly
en :D i'm online but you arent RT that much
en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL
ca [status ID] [datetime] [userID] [language of UI]
@xxx xDDD no m'extranya... Tal volta haguera segut
millor per a la humanitat que no l'haguera vist... you know..
xDD
Short Text Language Detection with Infinity-Gram
62
(NAIST Seminar)
61. Usage (5) Estimation Tool
• server.py -m [model] -p [port number]
– Open http://localhost:[port] after it is executed
– Output their language probabilities, contained
features and their parameters for a text inputed
in the text area
Short Text Language Detection with Infinity-Gram
63
(NAIST Seminar)
62. Estimation
language size detect correct precision recall LD53 LDsm
ca Catalan 5,093 4,923 4,857 98.66 95.37 95.3 97.0
cs Czech 7,681 7,668 7,663 99.93 99.77 96.3 99.7
da Dannish 5,516 5,472 5,310 97.04 96.27 94.5 92.4
de German 10,060 10,069 10,006 99.37 99.46 86.6 93.8
en English 10,162 10,133 10,029 98.97 98.69 88.3 95.0
es Spanish 10,244 10,284 10,120 98.41 98.79 91.5 96.0
fi Finnish 7,051 7,038 7,024 99.80 99.62 98.9 99.6
fr French 10,074 10,134 10,051 99.18 99.77 95.0 98.1
hu Hungarian 4,904 4,892 4,858 99.30 99.06 85.8 95.5
id Indonesian 10,178 10,225 10,160 99.36 99.82 89.7 98.9
it Italian 10,143 10,205 10,103 99.00 99.61 96.2 98.0
nl Dutch 10,005 9,916 9,858 99.42 98.53 69.5 97.4
no Norwegian 8,504 8,432 8,201 97.26 96.44 96.0 96.3
pl Polish 10,151 10,149 10,130 99.81 99.79 98.0 99.7
pt Portuguese 10,212 10,201 10,119 99.20 99.09 88.0 96.9
ro Romanian 5,913 5,867 5,850 99.71 98.93 92.8 97.4
sv Swedish 10,025 10,093 9,942 98.50 99.17 96.0 97.9
tr Turkish 10,308 10,317 10,298 99.82 99.90 97.6 99.5
vi Vietnamese 10,487 10,480 10,474 99.94 99.88 98.7 99.2
total 166,711 165,053 99.01 92.2 97.4
LD53 = langdetect + standard bundled profiles, LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability < 0.6 is treated undetectablely, the sum of detect is less than the sum of size
Short Text Language Detection with Infinity-Gram
64
(NAIST Seminar)
63. Estimation for LIGA dataset
• Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
– http://www.win.tue.nl/~mpechen/projects/smm/
Language size detect correct precision recall
de German 1479 1476 1469 99.5 99.3
en English 1505 1502 1490 99.2 99.0
es Spanish 1562 1548 1541 99.6 98.7
fr French 1551 1549 1540 99.4 99.3
it Italian 1539 1531 1528 99.8 99.3
nl Dutch 1430 1429 1424 99.7 99.6
total 9066 8992 99.2
※ Use 19 language model
Short Text Language Detection with Infinity-Gram
65
(NAIST Seminar)
64. Estimation for Europarl Dataset
ldig langdetect CLD
language size correct rate correct rate correct rate
bg Bulgarian 1000 988 98.8% 991 99.1%
cs Czech 1000 1000 100.0% 994 99.4% 995 99.5%
da Dannish 1000 976 97.6% 968 96.8% 932 93.2%
de German 1000 999 99.9% 998 99.8% 1000 100.0%
el Greek 1000 1000 100.0% 1000 100.0%
en English 1000 999 99.9% 996 99.6% 1000 100.0%
es Spanish 1000 1000 100.0% 996 99.6% 989 98.9%
et Estonian 1000 996 99.6% 998 99.8%
fi Finnish 1000 997 99.7% 998 99.8% 1000 100.0%
fr French 1000 999 99.9% 999 99.9% 992 99.2%
hu Hungarian 1000 1000 100.0% 999 99.9% 999 99.9%
it Italian 1000 999 99.9% 999 99.9% 996 99.6%
lt Lithuanian 1000 997 99.7% 999 99.9%
lv Latvian 1000 999 99.9% 998 99.8%
nl Dutch 1000 1000 100.0% 974 97.4% 995 99.5%
pl Polish 1000 998 99.8% 999 99.9% 997 99.7%
pt Portuguese 1000 995 99.5% 996 99.6% 989 98.9%
ro Romanian 1000 1000 100.0% 999 99.9% 998 99.8%
sk Slovak 1000 988 98.8% 990 99.0%
sl Slovene 1000 976 97.6% 963 96.3%
sv Swedish 1000 995 99.5% 991 99.1% 993 99.3%
total 21000 13957 99.7% 20850 99.3% 20814 99.1%
※ Only supported languages for ldig
Short Text Language Detection with Infinity-Gram
66
(NAIST Seminar)
65. Conclusions
• Language detector using maximal substring model
– Detect over 99% accuracy for 19 languages.
– langdetect with tweet corpus even has 97% accuracy.
• If the corpus is maintained, the precision will be still up.
– There are still many mistakes (in particular da and no)
• If metadata is added to features, the precision will be
still up.
– How to add and train metadata at low cost?
• Desire to shrink the model without loss of precision.
– Too large for application (>100MB)
Short Text Language Detection with Infinity-Gram
67
(NAIST Seminar)
66. References
• [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
• [Okanohara+ 09] Text Categorization with All Substring
Features
• [Brody+ 11] Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!
Using Word Lengthening to Detect Sentiment in
Microblogs
• [Cavnar+ 94] N-Gram-Based Text Categorization
• [Tsuruoka+ 09] Stochastic Gradient Descent Training
for L1-regularized Log-linear Models with Cumulative
Penalty
Short Text Language Detection with Infinity-Gram
68
(NAIST Seminar)