Short Text Language Detection with Infinity-Gram

Short Text Language Detection
with Infinity-Gram

2012/05/14 NAIST Seminar
Nakatani Shuyo @ Cybozu Labs Inc

Agenda
• Language Detection
• Proposal Method
– Maximal Substring
• Corpus
• Implementation and Estimations
• Conclusions

Short Text Language Detection with Infinity-Gram
4
(NAIST Seminar)

Language Detection

Short Text Language Detection with
5
Infinity-Gram (NAIST Seminar)

In What Language?
• Ik kan er nooit tegen als mensen me negeren.
• Aha ich seh angeblich süß aus
• Czy mógłbym zasnąć w przedmieściach Twoich myśli?
• Ah. Tak. Så skal jeg bare finde ud af *hvordan*!
• Det er ikke så digg nei å vi som har finale til helga....Skrekk og
gru! Takk :)
• tack kompis! Hade faktiskt tänkt maila dig på fb och fråga vart
du tog vägen!
• Çok doğru. En büyük hatayı yaptım.
• Încântat de cunoștință.
• Một người dân bị thương và bốn người mất tích sau khi một
ngọn núi lửa ở miền trung... Detection with Infinity-Gram
Short Text Language
6
(NAIST Seminar)

Hints
• Dutch if there is 'ik'
• German if there is 'ich' or a letter 'ß'
• Polish if there is 'czy' or letters 'Ł', 'ń', 'ś' or 'ź'
• Scandinavian if there is a letter 'å'
– Danish if there is 'af.' 'Tak' means 'thanks.'
– Norwegian if there is 'nei.' 'Takk' means 'thanks.'
– Swedish if there is "och." 'Tack' means 'thanks.'
• Turkish if there is a letter 'ı' ( 'i' without point) or 'ğ'
• Romanian if there is a letter 'ă' or 'ș' or 'ț'
– Although 'ă' is also used in Vietnamese, it is easy to distinguish them.
– Although 'ş' is also used in Turkish, it is easy to distinguish them.
• Vietnamese if there are many unreadable letters on WinXP :P
7
(NAIST Seminar)

In What Language? (Solution)
• Ik kan er nooit tegen als mensen me negeren. Dutch
• Aha ich seh angeblich süß aus German
• Czy mógłbym zasnąć w przedmieściach Twoich myśli? Polish
• Ah. Tak. Så skal jeg bare finde ud af *hvordan*! Danish
• Det er ikke så digg nei å vi som har finale til helga....Skrekk og
gru! Takk :) Norwegian
• tack kompis! Hade faktiskt tänkt maila dig på fb och fråga vart
du tog vägen! Swedish
• Çok doğru. En büyük hatayı yaptım. Turkish
• Încântat de cunoștință. Rumanian
• Một người dân bị thương và bốn người mất tích sau khi một
ngọn núi lửa ở miền trung... Detection with Infinity-Gram
Short Text Language
Vietnamese
8
(NAIST Seminar)

What's Language Detection
• To detect what language the input text written in
– Time fries like arrow → English
– Buona sera! → Italian
• It is prior for many language processing tasks
– Language model is built for each language
– Text search, classification, extraction, translation, ...
• It is possible to detect for long enough and
noiseless text with more than 99% accuracy
[Cavnar+ 94]
– 3-gram model is used in many methods

9
(NAIST Seminar)

SPAM or not?

• It is necessary to know that it is written in Polish.
10
(NAIST Seminar)

Document Categorization
with Naive Bayes Classifier
• Categorize a document 𝑋 = (𝑋 𝑖 ) into category 𝐶 𝑘
– A document 𝑋 is represented as collection of words
𝑋 𝑖 (bag-of-words)
• Word probability assumes conditionally independent on
each category
– 𝑝 𝑋 𝐶𝑘 = 𝑖 𝑝 𝑋 𝑖 𝐶k (from independent hypothesis)
– where 𝑝(𝑋 𝑖 |𝐶) : rate of word frequency for category
• Estimate the category 𝐶k to maximize posterior
𝑝 𝑋 𝐶k 𝑝 𝐶k
– 𝑝 𝐶k 𝑋 = ∝ 𝑝(𝐶k ) 𝑖 𝑝(𝑋 𝑖 |𝐶k )
𝑝 𝑋
– where 𝑝(𝐶k ) : prior for category

11
(NAIST Seminar)

Language Detection
with Naive Bayes Classifier
• Document categorization with language
labels
– Categorize documents into 'English', 'Japanese'
and so on
• Use character n-gram as features
– "Unicode code point n-gram", strictly speaking
– Assume character encoding of the document is
already known
• Most applications know encoding of inside text data

12
(NAIST Seminar)

Why Use n-Gram to Detect Language

• Each language has proper characters and spelling rules
– “é” is often used in Spanish, Italian and so on, but not in English
in principle
– There are many words which start with “Z” in German, but not
in English
– There are many words which start with “C” in English, but not in
German
– Spelling “Th” is often used in English, but not in the other
languages

□C □L □Z Th

□T h i s □ English 0.75 0.47 0.02 0.74
T h i s ←1-gram German 0.10 0.37 0.53 0.03
□T Th hi is s□ ←2-gram
French 0.38 0.69 0.01 0.01
□Th Thi his is□ ←3-gram
13
(NAIST Seminar)

language-detection(langdetect)
(Nakatani 2010)

• Language detection library for Java
– http://code.google.com/p/language-detection/
– Apache License 2.0
– Character 3-gram + Bayesian filter
– Various normalizations + Feature sampling
• 99% over precision for 53 languages
– Training with Wikipedia abstract
– Widely support including Asian languages
– Adopted by Apache Solr

14
(NAIST Seminar)

Estimation with News Text
Language size accuracy Language size accuracy
af Afrikaans 200 199 (99.50%) mr Marathi 200 200 (100.00%)
ar Arabic 200 200 (100.00%) ne Nepali 200 200 (100.00%)
bg Bulgarian 200 200 (100.00%) nl Dutch 200 200 (100.00%)
bn Bengali 200 200 (100.00%) no Norwegian 200 199 (99.50%)
cs Czech 200 200 (100.00%) pa Punjabi 200 200 (100.00%)
da Dannish 200 179 (89.50%) pl Polish 200 200 (100.00%)
de German 200 200 (100.00%) pt Portuguese 200 200 (100.00%)
el Greek 200 200 (100.00%) ro Romanian 200 200 (100.00%)
en English 200 200 (100.00%) ru Russian 200 200 (100.00%)
es Spanish 200 200 (100.00%) sk Slovak 200 200 (100.00%)
fa Persian 200 200 (100.00%) so Somali 200 200 (100.00%)
fi Finnish 200 200 (100.00%) sq Albanian 200 200 (100.00%)
fr French 200 200 (100.00%) sv Swedish 200 200 (100.00%)
gu Gujarati 200 200 (100.00%) sw Swahili 200 200 (100.00%)
he Hebrew 200 200 (100.00%) ta Tamil 200 200 (100.00%)
hi Hindi 200 200 (100.00%) te Telugu 200 200 (100.00%)
hr Croatian 200 200 (100.00%) th Thai 200 200 (100.00%)
hu Hungarian 200 200 (100.00%) tl Tagalog 200 200 (100.00%)
id Indonesian 200 200 (100.00%) tr Turkish 200 200 (100.00%)
it Italian 200 200 (100.00%) uk Ukrainian 200 200 (100.00%)
ja Japanese 200 200 (100.00%) ur Urdu 200 200 (100.00%)
kn Kannada 200 200 (100.00%) vi Vietnamese 200 200 (100.00%)
ko Korean 200 200 (100.00%) zh-cn Simplified Chinese 200 200 (100.00%)
mk Macedonian 200 200 (100.00%) zh-tw Traditional Chinese 200 200 (100.00%)
ml Malayalam 200 200 (100.00%) total 9800 9777 (99.77%)

• Test for crawled news text from web in 49 languages
15
(NAIST Seminar)

Estimation with Europarl datasets
language size correct accuracy
bg Bulgarian 1000 988 98.8% • Test for 1000 samples for each
cs Czech 1000 994 99.4%
da Dannish 1000 968 96.8% language from Europarl Parallel Corpus
de German 1000 998 99.8%
– from the proceedings of the European Parliament
el Greek 1000 1000 100.0%
en English 1000 996 99.6% – http://www.statmt.org/europarl/
es Spanish 1000 996 99.6%
et Estonian 1000 996 99.6% • http://code.google.com/p/language-
fi Finnish 1000 998 99.8% detection/downloads/detail?name=eur
fr French 1000 999 99.9%
hu Hungarian 1000 999 99.9% oparl-test.zip
it Italian 1000 999 99.9%
lt Lithuanian 1000 997 99.7%
lv Latvian 1000 999 99.9%
nl Dutch 1000 974 97.4%
pl Polish 1000 999 99.9%
pt Portuguese 1000 996 99.6%
ro Romanian 1000 999 99.9%
sk Slovak 1000 988 98.8%
sl Slovene 1000 976 97.6%
sv Swedish 1000 991 99.1%
total 21000 20850 99.3%

16
(NAIST Seminar)

Language Detection has been over,
isn't it?

17

We still have ENEMY to beat!

18
(NAIST Seminar)

Twitter Language Detection
with the Existing Methods
• Only 90-95% accuracy
language LD CLD Tika
ca Catalan 95.3 93.0 83.8

for tweet corpus
cs Czech 96.3 96.6 ----
da Dannish 94.5 90.7 58.7
de German 86.6 96.8 73.1
en English 88.3 97.4 54.7
es Spanish 91.5 90.5 44.4 • LD = language-detection
fi Finnish 98.9 99.4 94.8
fr French 95.0 94.5 67.4 • CLD = Chromium Compact Language
hu Hungarian 85.8 89.0 76.2 Detection
id Indonesian 89.7 92.8 ----
it Italian 96.2 93.8 87.1 – http://code.google.com/p/chromium-
nl Dutch 69.5 93.2 65.0 compact-language-detector/
no Norwegian 96.0 74.9 68.6
– regard ms(Malay) as id(Indonesian)
pl Polish 98.0 97.8 88.8
pt Portuguese 88.0 88.6 47.4 • Tika = Apache Tika
ro Romanian 92.8 96.1 82.6
sv Swedish 96.0 96.4 75.6 – http://tika.apache.org/
tr Turkish 97.6 97.4 ---- – Estimate on 15 languages which Tika
vi Vietnamese 98.7 98.9 ---- supports in our tweet corpus
total 92.2 93.8 70.0

19
(NAIST Seminar)

Chromium Compact Language Detection
(CLD)

• Porting the language detector from
Google Chromium
– http://code.google.com/p/chromium-compact-language-detector/

– Implementation in C++, Python binding
– # of supported languages : CLD = 76,
langdetect = 53
– Accuracy : CLD = 98.82%, langdetect =
99.22%
• for 17 languages on Europarl datasets
• http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
20
(NAIST Seminar)

Is twitter Language Detection difficult? (1)

• Tweet is too short to extract 3-gram features
– At most 140 characters on twitter
– URLs, mentions and hashtags are not useful to
detect
• LIGA [Tromp+ 11]
– Graph-features based on 3-gram
• Add long distance features
• 95～98% accuracy for twitter Language Detection
• 6 languages (de, en, es, fr, it, nl)

21
(NAIST Seminar)

Is twitter Language Detection difficult? (2)

• Tweet is too noisy
– Representations against the language's orthography often
appear
– Acronym, Abbreviation, lengthened word (like 'Cooooolll')
• Likelihood of tweet tends to get smaller on normal
language model
OMG Oh My God u you
LOL Laughing Out Loud ur your Letter 'k' isn't
used in Italian
LMAO Laughing My Ass Out 4 for
F4F Follow for Follow i0u I love you
MDR Mort de Rire (French) k che (Italian)
TKT Ne t‘Inquiète Pas (Fr) anke anche(Italian)
22
(NAIST Seminar)

Motivation to Detect Short Text Language

• There are many small chunks of text in addition
to twitter
– Schedule, search query, bulletin board and so on
– There are many questions about short text detection
in the Issues Board of langdetect Project
• http://code.google.com/p/language-detection/issues/detail?id=10

• Detection for multi-language mixed text
– Cut the target document in paragraphs or lines
– Detect for each short text

23
(NAIST Seminar)

Our Goal
• Over 99% accuracy
– However it is too difficult to detect "one
word sentence"...
– Our Goal is 99%+ accurate detection for
"sentence with more than 3 words"

24
(NAIST Seminar)

We need
• Rich feature extractable model from
short text,
– Maximal substring model
(∞-gram Logistic Regression)
• and twitter-specific Language model
or Corpus to construct it.
– about 700K tweet corpus with language
label

25
(NAIST Seminar)

Proposal Method

26

How to increase features from 3-grams
# of n-gram
gram
freq≧1 freq≧2 freq≧10 • The more n, the
1 79 72 57 more features
2 1896 1533 902
3 15970 10369 4525
• Maximum at
4 64966 33941 10534 n=∞, that is all
5 167543 69719 15538 substring
6 323749 107861 18970
– But it has O(T2)
7 524634 142954 21093
order
8 760719 171995 22159
9 921361 193995 22696
: : : :
※ cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
27
(NAIST Seminar)

Text Categorization with All Substring Features
[Okanohara+ 09]

• Multiclass Logistic Regression using all
substrings as features
– Maximal Substring makes the equivalent
model that can be constructed in linear
time

– Store features into TRIE, fast prediction

28
(NAIST Seminar)

Maximal Substring (1)
• Define a containment(semi-order)
among non empty substrings

abracadabra
– “ra” ⊂ “bra“ ⇔ all ”ra” occur
as the substring of “bra”
– “a” ⊄ “ra“ ⇔ “a” occur in not only “ra“
but also “ca”
※It is strictly defined with also its position in the substring.
29
(NAIST Seminar)

Maximal Substring (2)

via http://d.hatena.ne.jp/nokuno/20120203/1328237067

• Each equivalent class formed by the containment
relationship has a unique maximal element, that is
named "Maximal Substring".
• Maximal substrings of "abracadabra" are "a", "abra"
and "abracadabra".

30
(NAIST Seminar)

Maximal Substring and Infinity-Gram

• Frequencies of substrings that have a
containment relationship always equal.

• In the model with linear combination of
features, it is possible to enclose the common
feature values.

• Logistic regression with maximal substrings is
equivalent to the one with infinity-grams.
※ Although the equivalence collapses for test set,
we assumes that it can be approximated by a sufficiently large training set.
31
(NAIST Seminar)

Extended Suffix Array
• Extended Suffix Array consists of
– SA=Suffix Array,
– L=Longest Common Prefixes,
– B=Burrows-Wheeler's Transformed text.
• A maximal substring that occurs more than once corresponds
to a internal node of Suffix Tree, which is equivalent to a
suffix with L>0 and BWT has more than 1 character type.
– They can be calculated on linear time.

• esaxx : Okanohara's implement of ESA
– http://code.google.com/p/esaxx/

Short Text Language Detection with Infinity-Gram via [Okanohara+ 09]
32
(NAIST Seminar)

Corpus and Normalization

33

Target Languages
• Limit character type to detect
– In short text detection, mixed text can be
divided to type of characters
• Latin alphabet language
– The most difficult alphabet type to detect
– Languages which speakers are over 5
million are more than 25.

34
(NAIST Seminar)

What's Latin Alphabet?
• Latin alphabet ≠ ascii alphabet
– å, ą, æ, ð, Ħ, ŋ and so on...
• They are assigned to 9 code blocks in Unicode

Range Name Supplement
U+0000-007F Basic Latin ascii
U+0080-00FF Latin-1 Supplement Most languages are covered
U+0100-017F Latin Extended-A with these.
U+0180-024F Latin Extended-B Rumanian
U+0250-02AF IPA Extensions
U+0300-036F Combining Diacritical Marks for tone symbol composition
U+1E00-1EFF Latin Extended Additional Vietnamese
U+2C60-2C7F Latin Extended-C These aren’t used by almost
U+A720-A7FF Latin Extended-D all present languages
35
(NAIST Seminar)

Latin Alphabets
in Unicode Codepoint Chart
use often use sometimes for Vietnamese only

36
(NAIST Seminar)

How to Create Corpus
• Collect tweets with 'sample' method of
twitter Streaming API
– Sampling 1% of all tweets (about 2
million tweets).
– Tweets in Latin alphabet language
account for 60% of them.
• The rest is only to annotate language
labels to these tweets
37
(NAIST Seminar)

Language Label Annotation
• Group tweets by their timezone
– French tweets account for about 1% of all ones
– But they account for 50% of ones in Paris
timezone only
• Annotate tentative labels to tweets using
langdetect
– Remove non-French tweets from ones labeled ‘fr’
– Recover French tweets from ones not labeled ‘fr’

(※ 20% of the whole tweets have no timezone)
38
(NAIST Seminar)

How to annotate

Swedish, Norwegian, Danish, Vietnamese, Lithuanian,
Czech, Hungarian, Catalan, Rumanian and Polish guides in turn
39
(NAIST Seminar)

Created Corpus
language training test
ca Catalan 9,089 5,082
cs Czech 9,082 7,682
da Dannish 7,388 5,524
de German 44,448 10,065
en English 44,520 10,168
es Spanish 44,118 10,265
fi Finnish 8,087 7,050
fr French 44,339 10,098
hu Hungarian 10,030 4,904
id Indonesian 44,722 10,181
it Italian 43,366 10,152
nl Dutch 44,682 10,007 • Noiseless tweets for training
no Norwegian 10,124 8,496 data
pl Polish 16,771 10,152
pt Portuguese 44,215 10,208 • Noiseful tweets with more
ro Romanian 10,021 5,911 than 3 words as test data
sv Swedish 44,054 10,032
tr Turkish 44,703 10,308 • Work with Raúl Velaz and
vi Vietnamese
total
15,030
538,789
10,488
166,773
Hiroshi Manabe for Catalan
corpus creation
40
(NAIST Seminar)

Simple Language Detection
• Language detector can be constructed
from maximal substring model and
twitter corpus
– It still gets at most 98% accuracy.
• We guess it is necessary to reduce bias.
– data size bias
– language-specific bias
– twitter-specific bias

41
(NAIST Seminar)

Bias by Data Size
• Tweet size in each language has huge bias.
• Level them out by sampling with replacement
from each language up to the largest data
– It actually approximates to copy the integer multiple
of data and sample the rest without replacement
English
Portuguese
Spanish
Indonesian
Dutch
French
German
Turkish
Italian
Swedish
others
42
(NAIST Seminar)

Convert to Lowercase
on Multiple Languages
• Conversion into lower case saves corpus and
compresses model.
• But the lower case of I (U+0049) in Turkish
differs from others.
• Convert to lower case excluding ‘I’
Upper case Lower case

Turkish I (U+0049) ı (U+0131)
Azerbaijani
İ (U+0130) i (U+0069)
Others I (U+0049) i (U+0069)
43
(NAIST Seminar)

Normalization for Rumanian
• Rumanian uses â, ă, î, ș, ț in addition to a-z
• There are 2 character type as s/t with a “beard”
– U+015E-F, U+0162-3 : s/t with cedilla
– U+0218-B : s/t with comma below
• ‘s/t with cedilla’ is more popular on news, twitter and Wikipedia.
• The 2 code has the same design in some fonts...
– Indistinguishable!!

ș ş ț ţ
U+0219 U+015F U+021B U+0163
Short Text Language Detection with Infinity-Gram 44
(NAIST Seminar)

Rumanian Character Affairs on PC
• Although Romanian orthography provided that ‘s/t
with comma’ must be used, they was not available
to PC until recently.
– 1989 Democratization in Rumania
– 2001 ‘s/t with comma’ was provided by ISO8859-16(Latin-10) and Unicode
– 2007 Rumania seated in the EU
– 2007 Windows Vista supported ‘s/t with comma’ (available for everyone!)

‘s/t with cedilla’ is used
on an advertisement board
in Bucharest
45
(NAIST Seminar)

Normalization
for Substitute Characters
• ‘s/t with cedilla’ are substitute characters
– But they are more popular than the others
– with cedilla : with comma = 2 : 1
– “Rumanian IME” outputs the substitutes too :D
• Regard ‘s/t with comma’ as ‘s/t with cedilla’

ț ţ
I reckon it is similar to
the relationship of
Japanese character ‘SA’!!

U+021B Short Text ささ
U+0163 Language Detection with Infinity-Gram
(NAIST Seminar)
46

Arabic Character Normalization
(on language-detection)

• Arabic and Persian have the similar trouble too.
• Character ‘yeh’ in Farsi corresponds to 2 code points.
– Wikipedia uses ‫( ی‬U+06cc, Farsi yeh) only
– News uses ‫(ي‬U+064a, Arabic yeh) only
• U+064a is a substitute in Farsi
– The popular Arabic charset CP-1256 has no character
mapped into U+06cc
– As ‘yeh’ is very often used in both languages, quite all
Persian text detection fails
• Regard U+06cc as U+064a

47
(NAIST Seminar)

Normalization for Vietnamese (1)

• Vietnamese has 12 vowels
– a, ă, â, e, ê, i, y, o, ô, ơ, u, ư
• Vietnamese has 6 tones
– a, ả, à, ã, á, ạ
– These tone symbols are used also in
general documents like news.
• The tone symbols can be appended to
all vowels
– 12 * 6 = 72
48
(NAIST Seminar)

Normalization for Vietnamese (2)
• Representation of vowels with
tones
1. Use U+1ea0 - U+1ef9
• ẵ = U+1eb5
2. Combine with Diacritical Marks
• ẵ = U+0103 U+0303
– Half and half on news and tweet
• Normalize 2 into 1
49
(NAIST Seminar)

CJK-Kanji Normalization (1)

• CJK-Kanji has too many characters(more than 20K)
– Other character types have only 30-50 characters.
• The character space is very sparse.
– Characters that don’t occur in the training corpus have no
probabilities.
• e.g. "谢谢", Kanji for person name
– Common frequent characters are too strong.
• e.g. : a text which has ”的” tends to be detected as Traditional
Chinese
• Hence Kana is used in Japanese too, the probabilities of Kanji in
Japanese are less than ones in Chinese.

50
(NAIST Seminar)

CJK-Kanji Normalization (2)

• Group Kanjis by frequency and normalize each group to the
representative character
– (1) K-means clustering
• Use tf-idf on Wikipedia and Google News
• K=50 (size of ascii alphabet = 52)
– (2) “Commonly Used Kanji” provided in Japanese and Chinese
• Simplified Chinese : 现代汉语常用字表(3500)
• Traditional Chinese :常用国字標準字体表(4808)
⊂ Big5 the first standard(5401)
• Japanese : 常用漢字(2136)∪ JIS the first standard(2965) = 2998
– 常用漢字 doesn’t have Kanji for person name and place name very much

• Generate 130 clusters from product of (1) and (2)

51
(NAIST Seminar)

Normalization for twitter
• Remove simply
– URL
– mention
– hash tag
– RT
– face mark using alphabet like XD, :p

52
(NAIST Seminar)

Normalization for
twitter-Specific Representation
• How to Like ‘coooooooollllll’
• Case 1: Make a normalization dictionary using [Brody+
2011]
– Unsupervised normalization like coooollll → cool
– It can’t handle words that are not in the dictionary
• Case 2: If the same character continues in more than 3,
Shrink it to 2
– There is no language which over 3 continuation of the
same Latin alphabet in orthography of.
• If in Japanese, there are “かたたたき”, “かわいいいぬ”, “あわてて
て” and so on.
• Acronym (like WWW, СССР) is not useful for language detection

53
(NAIST Seminar)

Laugh Normalization
• There are various laughs on each language
– HOW MUCH DO YOU LOVE COACH BEISTE???
HHAHAHAHAHAH
– Hihihihi. :) Habe ich regulär 2x die Woche!
– Tafil con eso...!!! Jajajajajajaja
– Malo?? Jejejeje XP
– kekeke chỗ đó làm áo được ko em?
• Shrink them to double
– hahahha ⇒ haha
54
(NAIST Seminar)

Implementation and Estimation

55

Language Detection with Infinity-Gram
(ldig)

• tweet language detection for Latin
alphabet
– https://github.com/shuyo/ldig
• MIT license
• Distribute also the trained model here
– ∞-gram LR(maximal substring) [Okanohara+ 09]
– L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

– Double Array
56
(NAIST Seminar)

Usage (1) Model Initialization
• ldig.py -m [model] --init [corpus]
-x [maximal string extractor]
--ff=[lower limit of frequency]
– Extract features from corpus and initialize
model
– -m : model directory
– -x : path of maximal substring extractor
(execute as external process)
– --ff : Ignore less than the specified value

57
(NAIST Seminar)

Maximal String Extractor
• maxsubst [input file] [output file]
– Input as multiple line text
• Replace TABs to “ “, line feeds to U+0001 in it
– Output as ”[features]¥t[frequency]”

58
(NAIST Seminar)

Usage (2) Learn
• ldig.py -m [model] --learning [corpus]
-e [learning rate] -r [regularizer]
--wr=[whole regularization]
– Learn the model using the corpus on 1 cycle of SGD
– -e : learning rate of SGD
– -r : regularizer of L1 regularization
– --wr : what times to regularize for whole parameters
• Parameters are too many to regularize the whle ones
every step

59
(NAIST Seminar)

Usage (3) Shrink Model
• ldig.py -m [model] --shrink
– Remove Unefficient features(all
parameters of which are 0) from the
model

60
(NAIST Seminar)

Usage (4) Detect Language
• ldig.py -m [model] [test data]
– Detect languages of test data and output
its result and summary

61
(NAIST Seminar)

Data Format
• Training and test data
– [correct label]¥t[meta data]¥t[text]

en u should just enjoy ur vacation sadly
en :D i'm online but you arent RT that much
en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

ca [status ID] [datetime] [userID] [language of UI]
@xxx xDDD no m'extranya... Tal volta haguera segut
millor per a la humanitat que no l'haguera vist... you know..
xDD

62
(NAIST Seminar)

Usage (5) Estimation Tool
• server.py -m [model] -p [port number]
– Open http://localhost:[port] after it is executed
– Output their language probabilities, contained
features and their parameters for a text inputed
in the text area

63
(NAIST Seminar)

Estimation
language size detect correct precision recall LD53 LDsm
ca Catalan 5,093 4,923 4,857 98.66 95.37 95.3 97.0
cs Czech 7,681 7,668 7,663 99.93 99.77 96.3 99.7
da Dannish 5,516 5,472 5,310 97.04 96.27 94.5 92.4
de German 10,060 10,069 10,006 99.37 99.46 86.6 93.8
en English 10,162 10,133 10,029 98.97 98.69 88.3 95.0
es Spanish 10,244 10,284 10,120 98.41 98.79 91.5 96.0
fi Finnish 7,051 7,038 7,024 99.80 99.62 98.9 99.6
fr French 10,074 10,134 10,051 99.18 99.77 95.0 98.1
hu Hungarian 4,904 4,892 4,858 99.30 99.06 85.8 95.5
id Indonesian 10,178 10,225 10,160 99.36 99.82 89.7 98.9
it Italian 10,143 10,205 10,103 99.00 99.61 96.2 98.0
nl Dutch 10,005 9,916 9,858 99.42 98.53 69.5 97.4
no Norwegian 8,504 8,432 8,201 97.26 96.44 96.0 96.3
pl Polish 10,151 10,149 10,130 99.81 99.79 98.0 99.7
pt Portuguese 10,212 10,201 10,119 99.20 99.09 88.0 96.9
ro Romanian 5,913 5,867 5,850 99.71 98.93 92.8 97.4
sv Swedish 10,025 10,093 9,942 98.50 99.17 96.0 97.9
tr Turkish 10,308 10,317 10,298 99.82 99.90 97.6 99.5
vi Vietnamese 10,487 10,480 10,474 99.94 99.88 98.7 99.2
total 166,711 165,053 99.01 92.2 97.4
LD53 = langdetect + standard bundled profiles, LDsm = langdetect + profiles based on twitter corpus
As a text with maximum probability < 0.6 is treated undetectablely, the sum of detect is less than the sum of size
64
(NAIST Seminar)

Estimation for LIGA dataset
• Estimate using LIGA[Tromp+ 11] dataset
with 9066 tweets for 6 languages
– http://www.win.tue.nl/~mpechen/projects/smm/

Language size detect correct precision recall
de German 1479 1476 1469 99.5 99.3
en English 1505 1502 1490 99.2 99.0
es Spanish 1562 1548 1541 99.6 98.7
fr French 1551 1549 1540 99.4 99.3
it Italian 1539 1531 1528 99.8 99.3
nl Dutch 1430 1429 1424 99.7 99.6
total 9066 8992 99.2
※ Use 19 language model

65
(NAIST Seminar)

Estimation for Europarl Dataset
ldig langdetect CLD
language size correct rate correct rate correct rate
bg Bulgarian 1000 988 98.8% 991 99.1%
cs Czech 1000 1000 100.0% 994 99.4% 995 99.5%
da Dannish 1000 976 97.6% 968 96.8% 932 93.2%
de German 1000 999 99.9% 998 99.8% 1000 100.0%
el Greek 1000 1000 100.0% 1000 100.0%
en English 1000 999 99.9% 996 99.6% 1000 100.0%
es Spanish 1000 1000 100.0% 996 99.6% 989 98.9%
et Estonian 1000 996 99.6% 998 99.8%
fi Finnish 1000 997 99.7% 998 99.8% 1000 100.0%
fr French 1000 999 99.9% 999 99.9% 992 99.2%
hu Hungarian 1000 1000 100.0% 999 99.9% 999 99.9%
it Italian 1000 999 99.9% 999 99.9% 996 99.6%
lt Lithuanian 1000 997 99.7% 999 99.9%
lv Latvian 1000 999 99.9% 998 99.8%
nl Dutch 1000 1000 100.0% 974 97.4% 995 99.5%
pl Polish 1000 998 99.8% 999 99.9% 997 99.7%
pt Portuguese 1000 995 99.5% 996 99.6% 989 98.9%
ro Romanian 1000 1000 100.0% 999 99.9% 998 99.8%
sk Slovak 1000 988 98.8% 990 99.0%
sl Slovene 1000 976 97.6% 963 96.3%
sv Swedish 1000 995 99.5% 991 99.1% 993 99.3%
total 21000 13957 99.7% 20850 99.3% 20814 99.1%
※ Only supported languages for ldig
66
(NAIST Seminar)

Conclusions
• Language detector using maximal substring model
– Detect over 99% accuracy for 19 languages.
– langdetect with tweet corpus even has 97% accuracy.
• If the corpus is maintained, the precision will be still up.
– There are still many mistakes (in particular da and no)
• If metadata is added to features, the precision will be
still up.
– How to add and train metadata at low cost?
• Desire to shrink the model without loss of precision.
– Too large for application (>100MB)

67
(NAIST Seminar)

References
• [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
• [Okanohara+ 09] Text Categorization with All Substring
Features
• [Brody+ 11] Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!
Using Word Lengthening to Detect Sentiment in
Microblogs
• [Cavnar+ 94] N-Gram-Based Text Categorization
• [Tsuruoka+ 09] Stochastic Gradient Descent Training
for L1-regularized Log-linear Models with Cumulative
Penalty

68
(NAIST Seminar)

Short Text Language Detection with Infinity-Gram

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Short Text Language Detection with Infinity-Gram

Ähnlich wie Short Text Language Detection with Infinity-Gram (15)

Mehr von Shuyo Nakatani

Mehr von Shuyo Nakatani (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Short Text Language Detection with Infinity-Gram