SlideShare ist ein Scribd-Unternehmen logo
1 von 66
Downloaden Sie, um offline zu lesen
Short Text Language Detection
      with Infinity-Gram

   2012/05/14 NAIST Seminar
Nakatani Shuyo @ Cybozu Labs Inc
Agenda
• Language Detection
• Proposal Method
  – Maximal Substring
• Corpus
• Implementation and Estimations
• Conclusions

            Short Text Language Detection with Infinity-Gram
                                                               4
                           (NAIST Seminar)
Language Detection

          Short Text Language Detection with
                                               5
            Infinity-Gram (NAIST Seminar)
In What Language?
• Ik kan er nooit tegen als mensen me negeren.
• Aha ich seh angeblich süß aus
• Czy mógłbym zasnąć w przedmieściach Twoich myśli?
• Ah. Tak. Så skal jeg bare finde ud af *hvordan*!
• Det er ikke så digg nei å vi som har finale til helga....Skrekk og
  gru! Takk :)
• tack kompis! Hade faktiskt tänkt maila dig på fb och fråga vart
  du tog vägen!
• Çok doğru. En büyük hatayı yaptım.
• Încântat de cunoștință.
• Một người dân bị thương và bốn người mất tích sau khi một
  ngọn núi lửa ở miền trung... Detection with Infinity-Gram
                   Short Text Language
                                                                       6
                               (NAIST Seminar)
Hints
• Dutch if there is 'ik'
• German if there is 'ich' or a letter 'ß'
• Polish if there is 'czy' or letters 'Ł', 'ń', 'ś' or 'ź'
• Scandinavian if there is a letter 'å'
    – Danish if there is 'af.' 'Tak' means 'thanks.'
    – Norwegian if there is 'nei.' 'Takk' means 'thanks.'
    – Swedish if there is "och." 'Tack' means 'thanks.'
• Turkish if there is a letter 'ı' ( 'i' without point) or 'ğ'
• Romanian if there is a letter 'ă' or 'ș' or 'ț'
    – Although 'ă' is also used in Vietnamese, it is easy to distinguish them.
    – Although 'ş' is also used in Turkish, it is easy to distinguish them.
• Vietnamese if there are many unreadable letters on WinXP :P
                         Short Text Language Detection with Infinity-Gram
                                                                                 7
                                        (NAIST Seminar)
In What Language? (Solution)
• Ik kan er nooit tegen als mensen me negeren.            Dutch
• Aha ich seh angeblich süß aus                           German
• Czy mógłbym zasnąć w przedmieściach Twoich myśli? Polish
• Ah. Tak. Så skal jeg bare finde ud af *hvordan*!                Danish
• Det er ikke så digg nei å vi som har finale til helga....Skrekk og
  gru! Takk :)                                             Norwegian
• tack kompis! Hade faktiskt tänkt maila dig på fb och fråga vart
  du tog vägen!                                         Swedish
• Çok doğru. En büyük hatayı yaptım.            Turkish
• Încântat de cunoștință.                       Rumanian
• Một người dân bị thương và bốn người mất tích sau khi một
  ngọn núi lửa ở miền trung... Detection with Infinity-Gram
                   Short Text Language
                                               Vietnamese
                                                                           8
                              (NAIST Seminar)
What's Language Detection
• To detect what language the input text written in
  – Time fries like arrow → English
  – Buona sera!                     → Italian
• It is prior for many language processing tasks
  – Language model is built for each language
  – Text search, classification, extraction, translation, ...
• It is possible to detect for long enough and
  noiseless text with more than 99% accuracy
  [Cavnar+ 94]
  – 3-gram model is used in many methods

                  Short Text Language Detection with Infinity-Gram
                                                                     9
                                 (NAIST Seminar)
SPAM or not?




• It is necessary to know that it is written in Polish.
                Short Text Language Detection with Infinity-Gram
                                                                   10
                               (NAIST Seminar)
Document Categorization
       with Naive Bayes Classifier
• Categorize a document 𝑋 = (𝑋 𝑖 ) into category 𝐶 𝑘
   – A document 𝑋 is represented as collection of words
     𝑋 𝑖 (bag-of-words)
• Word probability assumes conditionally independent on
  each category
   – 𝑝 𝑋 𝐶𝑘 =     𝑖   𝑝 𝑋 𝑖 𝐶k       (from independent hypothesis)
   – where 𝑝(𝑋 𝑖 |𝐶) : rate of word frequency for category
• Estimate the category 𝐶k to maximize posterior
                𝑝 𝑋 𝐶k 𝑝 𝐶k
   – 𝑝 𝐶k 𝑋 =                    ∝ 𝑝(𝐶k )        𝑖   𝑝(𝑋 𝑖 |𝐶k )
                    𝑝 𝑋
   – where 𝑝(𝐶k ) : prior for category


                      Short Text Language Detection with Infinity-Gram
                                                                         11
                                     (NAIST Seminar)
Language Detection
    with Naive Bayes Classifier
• Document categorization with language
  labels
  – Categorize documents into 'English', 'Japanese'
    and so on
• Use character n-gram as features
  – "Unicode code point n-gram", strictly speaking
  – Assume character encoding of the document is
    already known
     • Most applications know encoding of inside text data

                Short Text Language Detection with Infinity-Gram
                                                                   12
                               (NAIST Seminar)
Why Use n-Gram to Detect Language

 • Each language has proper characters and spelling rules
     – “é” is often used in Spanish, Italian and so on, but not in English
       in principle
     – There are many words which start with “Z” in German, but not
       in English
     – There are many words which start with “C” in English, but not in
       German
     – Spelling “Th” is often used in English, but not in the other
       languages

                                                                 □C        □L   □Z   Th

□T h i s □                                        English      0.75 0.47 0.02 0.74
      T    h      i     s            ←1-gram      German       0.10 0.37 0.53 0.03
□T   Th    hi    is    s□            ←2-gram
                                                  French       0.38 0.69 0.01 0.01
     □Th   Thi   his   is□           ←3-gram
                        Short Text Language Detection with Infinity-Gram
                                                                                          13
                                       (NAIST Seminar)
language-detection(langdetect)
                       (Nakatani 2010)

• Language detection library for Java
  – http://code.google.com/p/language-detection/
  – Apache License 2.0
  – Character 3-gram + Bayesian filter
  – Various normalizations + Feature sampling
• 99% over precision for 53 languages
  – Training with Wikipedia abstract
  – Widely support including Asian languages
  – Adopted by Apache Solr

               Short Text Language Detection with Infinity-Gram
                                                                  14
                              (NAIST Seminar)
Estimation with News Text
             Language     size     accuracy                  Language              size      accuracy
       af    Afrikaans      200   199 (99.50%)         mr    Marathi                 200    200 (100.00%)
       ar    Arabic         200   200 (100.00%)        ne    Nepali                  200    200 (100.00%)
       bg    Bulgarian      200   200 (100.00%)        nl    Dutch                   200    200 (100.00%)
       bn    Bengali        200   200 (100.00%)        no    Norwegian               200    199 (99.50%)
       cs    Czech          200   200 (100.00%)        pa    Punjabi                 200    200 (100.00%)
       da    Dannish        200   179 (89.50%)         pl    Polish                  200    200 (100.00%)
       de    German         200   200 (100.00%)        pt    Portuguese              200    200 (100.00%)
       el    Greek          200   200 (100.00%)        ro    Romanian                200    200 (100.00%)
       en    English        200   200 (100.00%)        ru    Russian                 200    200 (100.00%)
       es    Spanish        200   200 (100.00%)        sk    Slovak                  200    200 (100.00%)
       fa    Persian        200   200 (100.00%)        so    Somali                  200    200 (100.00%)
        fi   Finnish        200   200 (100.00%)        sq    Albanian                200    200 (100.00%)
       fr    French         200   200 (100.00%)        sv    Swedish                 200    200 (100.00%)
       gu    Gujarati       200   200 (100.00%)        sw    Swahili                 200    200 (100.00%)
       he    Hebrew         200   200 (100.00%)        ta    Tamil                   200    200 (100.00%)
       hi    Hindi          200   200 (100.00%)        te    Telugu                  200    200 (100.00%)
       hr    Croatian       200   200 (100.00%)        th    Thai                    200    200 (100.00%)
       hu    Hungarian      200   200 (100.00%)         tl   Tagalog                 200    200 (100.00%)
       id    Indonesian     200   200 (100.00%)        tr    Turkish                 200    200 (100.00%)
        it   Italian        200   200 (100.00%)        uk    Ukrainian               200    200 (100.00%)
       ja    Japanese       200   200 (100.00%)        ur    Urdu                    200    200 (100.00%)
       kn    Kannada        200   200 (100.00%)        vi    Vietnamese              200    200 (100.00%)
       ko    Korean         200   200 (100.00%)      zh-cn   Simplified Chinese      200    200 (100.00%)
       mk    Macedonian     200   200 (100.00%)      zh-tw   Traditional Chinese     200    200 (100.00%)
       ml    Malayalam      200   200 (100.00%)                total                9800   9777 (99.77%)


•   Test for crawled news text from web in 49 languages
                           Short Text Language Detection with Infinity-Gram
                                                                                                            15
                                          (NAIST Seminar)
Estimation with Europarl datasets
     language        size   correct accuracy
bg       Bulgarian     1000      988   98.8%         • Test for 1000 samples for each
cs        Czech        1000      994   99.4%
da       Dannish       1000      968   96.8%           language from Europarl Parallel Corpus
de        German       1000      998   99.8%
                                                         – from the proceedings of the European Parliament
el         Greek       1000    1000   100.0%
en        English      1000      996   99.6%             – http://www.statmt.org/europarl/
es        Spanish      1000      996   99.6%
et       Estonian      1000      996   99.6%         • http://code.google.com/p/language-
fi        Finnish      1000      998   99.8%           detection/downloads/detail?name=eur
fr        French       1000      999   99.9%
hu      Hungarian      1000      999   99.9%           oparl-test.zip
it         Italian     1000      999   99.9%
lt      Lithuanian     1000      997   99.7%
lv        Latvian      1000      999   99.9%
nl         Dutch       1000      974   97.4%
pl         Polish      1000      999   99.9%
pt     Portuguese      1000      996   99.6%
ro      Romanian       1000      999   99.9%
sk        Slovak       1000      988   98.8%
sl       Slovene       1000      976   97.6%
sv       Swedish       1000      991   99.1%
        total         21000   20850    99.3%


                             Short Text Language Detection with Infinity-Gram
                                                                                                             16
                                            (NAIST Seminar)
Language Detection has been over,
            isn't it?



                                    17
We still have ENEMY to beat!




        Short Text Language Detection with Infinity-Gram
                                                           18
                       (NAIST Seminar)
Twitter Language Detection
               with the Existing Methods
                                                  • Only 90-95% accuracy
     language        LD     CLD      Tika
ca        Catalan    95.3    93.0     83.8

                                                    for tweet corpus
cs         Czech     96.3    96.6    ----
da        Dannish    94.5    90.7     58.7
de        German     86.6    96.8     73.1
en        English    88.3    97.4     54.7
es        Spanish    91.5    90.5     44.4        • LD = language-detection
fi        Finnish    98.9    99.4     94.8
fr        French     95.0    94.5     67.4        • CLD = Chromium Compact Language
hu       Hungarian   85.8    89.0     76.2          Detection
id      Indonesian   89.7    92.8    ----
it         Italian   96.2    93.8     87.1              – http://code.google.com/p/chromium-
nl         Dutch     69.5    93.2     65.0                compact-language-detector/
no     Norwegian     96.0    74.9     68.6
                                                        – regard ms(Malay) as id(Indonesian)
pl         Polish    98.0    97.8     88.8
pt     Portuguese    88.0    88.6     47.4        • Tika = Apache Tika
ro       Romanian    92.8    96.1     82.6
sv       Swedish     96.0    96.4     75.6              – http://tika.apache.org/
tr        Turkish    97.6    97.4    ----               – Estimate on 15 languages which Tika
vi     Vietnamese    98.7    98.9    ----                 supports in our tweet corpus
        total        92.2    93.8     70.0

                        Short Text Language Detection with Infinity-Gram
                                                                                           19
                                       (NAIST Seminar)
Chromium Compact Language Detection
             (CLD)

• Porting the language detector from
  Google Chromium
  – http://code.google.com/p/chromium-compact-language-detector/

  – Implementation in C++, Python binding
  – # of supported languages : CLD = 76,
    langdetect = 53
  – Accuracy : CLD = 98.82%, langdetect =
    99.22%
      • for 17 languages on Europarl datasets
      •   http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
                         Short Text Language Detection with Infinity-Gram
                                                                                            20
                                        (NAIST Seminar)
Is twitter Language Detection difficult? (1)

 • Tweet is too short to extract 3-gram features
   – At most 140 characters on twitter
   – URLs, mentions and hashtags are not useful to
     detect
 • LIGA [Tromp+ 11]
   – Graph-features based on 3-gram
      • Add long distance features
      • 95~98% accuracy for twitter Language Detection
      • 6 languages (de, en, es, fr, it, nl)

                  Short Text Language Detection with Infinity-Gram
                                                                     21
                                 (NAIST Seminar)
Is twitter Language Detection difficult? (2)

 • Tweet is too noisy
    – Representations against the language's orthography often
      appear
    – Acronym, Abbreviation, lengthened word (like 'Cooooolll')
 • Likelihood of tweet tends to get smaller on normal
   language model
      OMG    Oh My God                          u            you
      LOL    Laughing Out Loud                  ur           your             Letter 'k' isn't
                                                                              used in Italian
      LMAO   Laughing My Ass Out                4            for
      F4F    Follow for Follow                  i0u          I love you
      MDR    Mort de Rire (French)              k            che (Italian)
      TKT    Ne t‘Inquiète Pas (Fr)             anke         anche(Italian)
                     Short Text Language Detection with Infinity-Gram
                                                                                            22
                                    (NAIST Seminar)
Motivation to Detect Short Text Language

• There are many small chunks of text in addition
  to twitter
  – Schedule, search query, bulletin board and so on
  – There are many questions about short text detection
    in the Issues Board of langdetect Project
     • http://code.google.com/p/language-detection/issues/detail?id=10

• Detection for multi-language mixed text
  – Cut the target document in paragraphs or lines
  – Detect for each short text



                   Short Text Language Detection with Infinity-Gram
                                                                         23
                                  (NAIST Seminar)
Our Goal
• Over 99% accuracy
  – However it is too difficult to detect "one
    word sentence"...
  – Our Goal is 99%+ accurate detection for
    "sentence with more than 3 words"




             Short Text Language Detection with Infinity-Gram
                                                                24
                            (NAIST Seminar)
We need
• Rich feature extractable model from
  short text,
  – Maximal substring model
    (∞-gram Logistic Regression)
• and twitter-specific Language model
      or Corpus to construct it.
  – about 700K tweet corpus with language
    label

             Short Text Language Detection with Infinity-Gram
                                                                25
                            (NAIST Seminar)
Proposal Method

          Short Text Language Detection with
                                               26
            Infinity-Gram (NAIST Seminar)
How to increase features from 3-grams
                        # of n-gram
  gram
              freq≧1         freq≧2          freq≧10           • The more n, the
     1                79             72               57         more features
     2           1896           1533                902
     3          15970         10369              4525
                                                               • Maximum at
     4          64966          33941            10534            n=∞, that is all
     5         167543          69719            15538            substring
     6         323749        107861             18970
                                                                     – But it has O(T2)
     7         524634        142954             21093
                                                                       order
     8         760719        171995             22159
     9         921361        193995             22696
     :            :              :                :
※ cumulative distributuion of feature length for 5090 normalized English tweets (300KB)
                             Short Text Language Detection with Infinity-Gram
                                                                                          27
                                            (NAIST Seminar)
Text Categorization with All Substring Features
                    [Okanohara+ 09]


• Multiclass Logistic Regression using all
  substrings as features
  – Maximal Substring makes the equivalent
    model that can be constructed in linear
    time


  – Store features into TRIE, fast prediction

              Short Text Language Detection with Infinity-Gram
                                                                 28
                             (NAIST Seminar)
Maximal Substring (1)
• Define a containment(semi-order)
  among non empty substrings

           abracadabra
  – “ra” ⊂ “bra“ ⇔ all ”ra” occur
                   as the substring of “bra”
  – “a” ⊄ “ra“      ⇔ “a” occur in not only “ra“
                      but also “ca”
                                ※It is strictly defined with also its position in the substring.
             Short Text Language Detection with Infinity-Gram
                                                                                          29
                            (NAIST Seminar)
Maximal Substring (2)




                                  via http://d.hatena.ne.jp/nokuno/20120203/1328237067

• Each equivalent class formed by the containment
  relationship has a unique maximal element, that is
  named "Maximal Substring".
• Maximal substrings of "abracadabra" are "a", "abra"
  and "abracadabra".

                Short Text Language Detection with Infinity-Gram
                                                                                    30
                               (NAIST Seminar)
Maximal Substring and Infinity-Gram

• Frequencies of substrings that have a
  containment relationship always equal.

• In the model with linear combination of
  features, it is possible to enclose the common
  feature values.

• Logistic regression with maximal substrings is
  equivalent to the one with infinity-grams.
                                                  ※ Although the equivalence collapses for test set,
                          we assumes that it can be approximated by a sufficiently large training set.
              Short Text Language Detection with Infinity-Gram
                                                                                                31
                             (NAIST Seminar)
Extended Suffix Array
• Extended Suffix Array consists of
   – SA=Suffix Array,
   – L=Longest Common Prefixes,
   – B=Burrows-Wheeler's Transformed text.
• A maximal substring that occurs more than once corresponds
  to a internal node of Suffix Tree, which is equivalent to a
  suffix with L>0 and BWT has more than 1 character type.
   – They can be calculated on linear time.


• esaxx : Okanohara's implement of ESA
   – http://code.google.com/p/esaxx/



                    Short Text Language Detection with Infinity-Gram   via [Okanohara+ 09]
                                                                                             32
                                   (NAIST Seminar)
Corpus and Normalization

Short Text Language Detection with
                                     33
  Infinity-Gram (NAIST Seminar)
Target Languages
• Limit character type to detect
  – In short text detection, mixed text can be
    divided to type of characters
• Latin alphabet language
  – The most difficult alphabet type to detect
  – Languages which speakers are over 5
    million are more than 25.


             Short Text Language Detection with Infinity-Gram
                                                                34
                            (NAIST Seminar)
What's Latin Alphabet?
• Latin alphabet ≠ ascii alphabet
  – å, ą, æ, ð, Ħ, ŋ and so on...
• They are assigned to 9 code blocks in Unicode

     Range                     Name                                   Supplement
   U+0000-007F   Basic Latin                              ascii
   U+0080-00FF   Latin-1 Supplement                       Most languages are covered
   U+0100-017F   Latin Extended-A                         with these.
   U+0180-024F   Latin Extended-B                         Rumanian
   U+0250-02AF   IPA Extensions
   U+0300-036F   Combining Diacritical Marks              for tone symbol composition
   U+1E00-1EFF   Latin Extended Additional                Vietnamese
   U+2C60-2C7F   Latin Extended-C                         These aren’t used by almost
   U+A720-A7FF   Latin Extended-D                         all present languages
                   Short Text Language Detection with Infinity-Gram
                                                                                    35
                                  (NAIST Seminar)
Latin Alphabets
     in Unicode Codepoint Chart
use often              use sometimes                           for Vietnamese only




            Short Text Language Detection with Infinity-Gram
                                                                                 36
                           (NAIST Seminar)
How to Create Corpus
• Collect tweets with 'sample' method of
  twitter Streaming API
  – Sampling 1% of all tweets (about 2
    million tweets).
  – Tweets in Latin alphabet language
    account for 60% of them.
• The rest is only to annotate language
  labels to these tweets
            Short Text Language Detection with Infinity-Gram
                                                               37
                           (NAIST Seminar)
Language Label Annotation
• Group tweets by their timezone
  – French tweets account for about 1% of all ones
  – But they account for 50% of ones in Paris
    timezone only
• Annotate tentative labels to tweets using
  langdetect
  – Remove non-French tweets from ones labeled ‘fr’
  – Recover French tweets from ones not labeled ‘fr’

                           (※ 20% of the whole tweets have no timezone)
              Short Text Language Detection with Infinity-Gram
                                                                    38
                             (NAIST Seminar)
How to annotate




   Swedish, Norwegian, Danish, Vietnamese, Lithuanian,
Czech, Hungarian, Catalan, Rumanian and Polish guides in turn
               Short Text Language Detection with Infinity-Gram
                                                                  39
                              (NAIST Seminar)
Created Corpus
     language         training       test
ca        Catalan        9,089        5,082
cs         Czech         9,082        7,682
da        Dannish        7,388        5,524
de        German        44,448       10,065
en        English       44,520       10,168
es        Spanish       44,118       10,265
fi        Finnish        8,087        7,050
fr        French        44,339       10,098
hu       Hungarian      10,030        4,904
id      Indonesian      44,722       10,181
it         Italian      43,366       10,152
nl         Dutch        44,682       10,007          • Noiseless tweets for training
no     Norwegian        10,124        8,496            data
pl         Polish       16,771       10,152
pt     Portuguese       44,215       10,208          • Noiseful tweets with more
ro       Romanian       10,021        5,911            than 3 words as test data
sv       Swedish        44,054       10,032
tr        Turkish       44,703       10,308          • Work with Raúl Velaz and
vi     Vietnamese
        total
                        15,030
                      538,789
                                     10,488
                                    166,773
                                                       Hiroshi Manabe for Catalan
                                                       corpus creation
                     Short Text Language Detection with Infinity-Gram
                                                                                40
                                    (NAIST Seminar)
Simple Language Detection
• Language detector can be constructed
  from maximal substring model and
  twitter corpus
  – It still gets at most 98% accuracy.
• We guess it is necessary to reduce bias.
  – data size bias
  – language-specific bias
  – twitter-specific bias

               Short Text Language Detection with Infinity-Gram
                                                                  41
                              (NAIST Seminar)
Bias by Data Size
• Tweet size in each language has huge bias.
• Level them out by sampling with replacement
  from each language up to the largest data
  – It actually approximates to copy the integer multiple
    of data and sample the rest without replacement
                                                          English
                                                          Portuguese
                                                          Spanish
                                                          Indonesian
                                                          Dutch
                                                          French
                                                          German
                                                          Turkish
                                                          Italian
                                                          Swedish
                                                           others
                Short Text Language Detection with Infinity-Gram
                                                                       42
                               (NAIST Seminar)
Convert to Lowercase
       on Multiple Languages
• Conversion into lower case saves corpus and
  compresses model.
• But the lower case of I (U+0049) in Turkish
  differs from others.
• Convert to lower case excluding ‘I’
                          Upper case                             Lower case

     Turkish          I (U+0049) ı (U+0131)
    Azerbaijani
                      İ (U+0130) i (U+0069)
      Others          I (U+0049) i (U+0069)
                  Short Text Language Detection with Infinity-Gram
                                                                              43
                                 (NAIST Seminar)
Normalization for Rumanian
• Rumanian uses â,              ă, î, ș, ț in addition to a-z
• There are 2 character type as s/t with a “beard”
   – U+015E-F, U+0162-3 : s/t with cedilla
   – U+0218-B : s/t with comma below
      • ‘s/t with cedilla’ is more popular on news, twitter and Wikipedia.
• The 2 code has the same design in some fonts...
   – Indistinguishable!!


            ș ş                                        ț ţ
           U+0219      U+015F                       U+021B              U+0163
                     Short Text Language Detection with Infinity-Gram            44
                                    (NAIST Seminar)
Rumanian Character Affairs on PC
• Although Romanian orthography provided that ‘s/t
  with comma’ must be used, they was not available
  to PC until recently.
  – 1989 Democratization in Rumania
  – 2001 ‘s/t with comma’ was provided by ISO8859-16(Latin-10) and Unicode
  – 2007 Rumania seated in the EU
  – 2007 Windows Vista supported ‘s/t with comma’ (available for everyone!)




                                                              ‘s/t with cedilla’ is used
                                                              on an advertisement board
                                                              in Bucharest
                     Short Text Language Detection with Infinity-Gram
                                                                                      45
                                    (NAIST Seminar)
Normalization
        for Substitute Characters
• ‘s/t with cedilla’ are substitute characters
  – But they are more popular than the others
  – with cedilla : with comma = 2 : 1
  – “Rumanian IME” outputs the substitutes too :D
• Regard ‘s/t with comma’ as ‘s/t with cedilla’


    ț           ţ
                                      I reckon it is similar to
                                         the relationship of
                                    Japanese character ‘SA’!!


   U+021B      Short Text               ささ
              U+0163 Language Detection with Infinity-Gram
                              (NAIST Seminar)
                                                                  46
Arabic Character Normalization
                    (on language-detection)

• Arabic and Persian have the similar trouble too.
• Character ‘yeh’ in Farsi corresponds to 2 code points.
   – Wikipedia uses     ‫( ی‬U+06cc, Farsi yeh) only
   – News uses   ‫(ي‬U+064a, Arabic yeh) only
• U+064a is a substitute in Farsi
   – The popular Arabic charset CP-1256 has no character
     mapped into U+06cc
   – As ‘yeh’ is very often used in both languages, quite all
     Persian text detection fails
• Regard U+06cc as U+064a


                   Short Text Language Detection with Infinity-Gram
                                                                      47
                                  (NAIST Seminar)
Normalization for Vietnamese (1)

• Vietnamese has 12 vowels
  – a, ă, â, e, ê, i, y, o, ô, ơ, u, ư
• Vietnamese has 6 tones
  – a, ả, à, ã, á, ạ
  – These tone symbols are used also in
    general documents like news.
• The tone symbols can be appended to
  all vowels
  – 12 * 6 = 72
               Short Text Language Detection with Infinity-Gram
                                                                  48
                              (NAIST Seminar)
Normalization for Vietnamese (2)
     • Representation of vowels with
       tones
       1. Use U+1ea0 - U+1ef9
         • ẵ = U+1eb5
       2. Combine with Diacritical Marks
         • ẵ = U+0103 U+0303
       – Half and half on news and tweet
     • Normalize 2 into 1
          Short Text Language Detection with Infinity-Gram
                                                             49
                         (NAIST Seminar)
CJK-Kanji Normalization (1)
                    (on language-detection)

• CJK-Kanji has too many characters(more than 20K)
   – Other character types have only 30-50 characters.
• The character space is very sparse.
   – Characters that don’t occur in the training corpus have no
     probabilities.
      • e.g. "谢谢", Kanji for person name
   – Common frequent characters are too strong.
      • e.g. : a text which has ”的” tends to be detected as Traditional
        Chinese
      • Hence Kana is used in Japanese too, the probabilities of Kanji in
        Japanese are less than ones in Chinese.



                    Short Text Language Detection with Infinity-Gram
                                                                            50
                                   (NAIST Seminar)
CJK-Kanji Normalization (2)
                      (on language-detection)

• Group Kanjis by frequency and normalize each group to the
  representative character
   – (1) K-means clustering
       • Use tf-idf on Wikipedia and Google News
       • K=50 (size of ascii alphabet = 52)
   – (2) “Commonly Used Kanji” provided in Japanese and Chinese
       • Simplified Chinese : 现代汉语常用字表(3500)
       • Traditional Chinese :常用国字標準字体表(4808)
                              ⊂ Big5 the first standard(5401)
       • Japanese : 常用漢字(2136)∪ JIS the first standard(2965) = 2998
           – 常用漢字 doesn’t have Kanji for person name and place name very much

• Generate 130 clusters from product of (1) and (2)


                      Short Text Language Detection with Infinity-Gram
                                                                                51
                                     (NAIST Seminar)
Normalization for twitter
• Remove simply
  – URL
  – mention
  – hash tag
  – RT
  – face mark using alphabet like XD, :p


               Short Text Language Detection with Infinity-Gram
                                                                  52
                              (NAIST Seminar)
Normalization for
 twitter-Specific Representation
• How to Like ‘coooooooollllll’
• Case 1: Make a normalization dictionary using [Brody+
  2011]
   – Unsupervised normalization like coooollll → cool
   – It can’t handle words that are not in the dictionary
• Case 2: If the same character continues in more than 3,
  Shrink it to 2
   – There is no language which over 3 continuation of the
     same Latin alphabet in orthography of.
      • If in Japanese, there are “かたたたき”, “かわいいいぬ”, “あわてて
        て” and so on.
      • Acronym (like WWW, СССР) is not useful for language detection

                   Short Text Language Detection with Infinity-Gram
                                                                      53
                                  (NAIST Seminar)
Laugh Normalization
• There are various laughs on each language
  – HOW MUCH DO YOU LOVE COACH BEISTE???
    HHAHAHAHAHAH
  – Hihihihi. :) Habe ich regulär 2x die Woche!
  – Tafil con eso...!!! Jajajajajajaja
  – Malo?? Jejejeje XP
  – kekeke chỗ đó làm áo được ko em?
• Shrink them to double
  – hahahha ⇒ haha
               Short Text Language Detection with Infinity-Gram
                                                                  54
                              (NAIST Seminar)
Implementation and Estimation

          Short Text Language Detection with
                                               55
            Infinity-Gram (NAIST Seminar)
Language Detection with Infinity-Gram
               (ldig)

• tweet language detection for Latin
  alphabet
  – https://github.com/shuyo/ldig
    • MIT license
    • Distribute also the trained model here
  – ∞-gram LR(maximal substring) [Okanohara+ 09]
  – L1 SGD (Cumulative Penalty)                                  [Tsuruoka+ 09]

  – Double Array
              Short Text Language Detection with Infinity-Gram
                                                                                  56
                             (NAIST Seminar)
Usage (1) Model Initialization
• ldig.py -m [model] --init [corpus]
           -x [maximal string extractor]
           --ff=[lower limit of frequency]
  – Extract features from corpus and initialize
    model
  – -m : model directory
  – -x : path of maximal substring extractor
    (execute as external process)
  – --ff : Ignore less than the specified value

              Short Text Language Detection with Infinity-Gram
                                                                 57
                             (NAIST Seminar)
Maximal String Extractor
• maxsubst [input file] [output file]
  – Input as multiple line text
    • Replace TABs to “ “, line feeds to U+0001 in it
  – Output as ”[features]¥t[frequency]”




              Short Text Language Detection with Infinity-Gram
                                                                 58
                             (NAIST Seminar)
Usage (2) Learn
• ldig.py -m [model] --learning [corpus]
            -e [learning rate] -r [regularizer]
            --wr=[whole regularization]
   – Learn the model using the corpus on 1 cycle of SGD
   – -e : learning rate of SGD
   – -r : regularizer of L1 regularization
   – --wr : what times to regularize for whole parameters
      • Parameters are too many to regularize the whle ones
        every step




                  Short Text Language Detection with Infinity-Gram
                                                                     59
                                 (NAIST Seminar)
Usage (3) Shrink Model
• ldig.py -m [model] --shrink
  – Remove Unefficient features(all
    parameters of which are 0) from the
    model




            Short Text Language Detection with Infinity-Gram
                                                               60
                           (NAIST Seminar)
Usage (4) Detect Language
• ldig.py -m [model] [test data]
  – Detect languages of test data and output
    its result and summary




             Short Text Language Detection with Infinity-Gram
                                                                61
                            (NAIST Seminar)
Data Format
• Training and test data
     – [correct label]¥t[meta data]¥t[text]

en     u should just enjoy ur vacation sadly
en     :D i'm online but you arent RT that much
en     im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

ca     [status ID]    [datetime]    [userID] [language of UI]
       @xxx xDDD no m'extranya... Tal volta haguera segut
millor per a la humanitat que no l'haguera vist... you know..
xDD

                   Short Text Language Detection with Infinity-Gram
                                                                      62
                                  (NAIST Seminar)
Usage (5) Estimation Tool
• server.py -m [model] -p [port number]
  – Open http://localhost:[port] after it is executed
  – Output their language probabilities, contained
    features and their parameters for a text inputed
    in the text area




               Short Text Language Detection with Infinity-Gram
                                                                  63
                              (NAIST Seminar)
Estimation
      language             size           detect         correct  precision             recall        LD53          LDsm
ca         Catalan          5,093           4,923           4,857    98.66               95.37          95.3          97.0
cs         Czech            7,681           7,668           7,663    99.93               99.77          96.3          99.7
da        Dannish           5,516           5,472           5,310    97.04               96.27          94.5          92.4
de        German           10,060          10,069         10,006     99.37               99.46          86.6          93.8
en         English         10,162          10,133         10,029     98.97               98.69          88.3          95.0
es        Spanish          10,244          10,284         10,120     98.41               98.79          91.5          96.0
fi         Finnish          7,051           7,038           7,024    99.80               99.62          98.9          99.6
fr         French          10,074          10,134         10,051     99.18               99.77          95.0          98.1
hu       Hungarian          4,904           4,892           4,858    99.30               99.06          85.8          95.5
id      Indonesian         10,178          10,225         10,160     99.36               99.82          89.7          98.9
 it         Italian        10,143          10,205         10,103     99.00               99.61          96.2          98.0
nl          Dutch          10,005           9,916           9,858    99.42               98.53          69.5          97.4
no      Norwegian           8,504           8,432           8,201    97.26               96.44          96.0          96.3
pl         Polish          10,151          10,149         10,130     99.81               99.79          98.0          99.7
pt      Portuguese         10,212          10,201         10,119     99.20               99.09          88.0          96.9
ro       Romanian           5,913           5,867           5,850    99.71               98.93          92.8          97.4
sv        Swedish          10,025          10,093           9,942    98.50               99.17          96.0          97.9
tr        Turkish          10,308          10,317         10,298     99.82               99.90          97.6          99.5
vi     Vietnamese          10,487          10,480         10,474     99.94               99.88          98.7          99.2
         total            166,711                        165,053                         99.01          92.2          97.4
                          LD53 = langdetect + standard bundled profiles, LDsm = langdetect + profiles based on twitter corpus
              As a text with maximum probability < 0.6 is treated undetectablely, the sum of detect is less than the sum of size
                                 Short Text Language Detection with Infinity-Gram
                                                                                                                            64
                                                (NAIST Seminar)
Estimation for LIGA dataset
• Estimate using LIGA[Tromp+ 11] dataset
  with 9066 tweets for 6 languages
      – http://www.win.tue.nl/~mpechen/projects/smm/



       Language          size           detect          correct precision            recall
de        German          1479            1476            1469      99.5                99.3
en         English        1505            1502            1490      99.2                99.0
es        Spanish         1562            1548            1541      99.6                98.7
fr         French         1551            1549            1540      99.4                99.3
 it         Italian       1539            1531            1528      99.8                99.3
nl          Dutch         1430            1429            1424      99.7                99.6
         total            9066                            8992                          99.2
                                                                         ※ Use 19 language model

                      Short Text Language Detection with Infinity-Gram
                                                                                               65
                                     (NAIST Seminar)
Estimation for Europarl Dataset
                                     ldig                langdetect             CLD
     language        size   correct         rate     correct     rate    correct      rate
bg       Bulgarian     1000                               988      98.8%      991       99.1%
cs        Czech        1000    1000         100.0%        994      99.4%      995       99.5%
da       Dannish       1000      976         97.6%        968      96.8%      932       93.2%
de        German       1000      999         99.9%        998      99.8%    1000      100.0%
el         Greek       1000                             1000     100.0%     1000      100.0%
en        English      1000      999         99.9%        996      99.6%    1000      100.0%
es        Spanish      1000    1000         100.0%        996      99.6%      989       98.9%
et       Estonian      1000                               996      99.6%      998       99.8%
fi        Finnish      1000      997         99.7%        998      99.8%    1000      100.0%
fr        French       1000      999         99.9%        999      99.9%      992       99.2%
hu      Hungarian      1000    1000         100.0%        999      99.9%      999       99.9%
it         Italian     1000      999         99.9%        999      99.9%      996       99.6%
lt      Lithuanian     1000                               997      99.7%      999       99.9%
lv        Latvian      1000                               999      99.9%      998       99.8%
nl         Dutch       1000    1000         100.0%        974      97.4%      995       99.5%
pl         Polish      1000      998         99.8%        999      99.9%      997       99.7%
pt     Portuguese      1000      995         99.5%        996      99.6%      989       98.9%
ro      Romanian       1000    1000         100.0%        999      99.9%      998       99.8%
sk        Slovak       1000                               988      98.8%      990       99.0%
sl       Slovene       1000                               976      97.6%      963       96.3%
sv       Swedish       1000      995         99.5%        991      99.1%      993       99.3%
        total         21000   13957          99.7%     20850       99.3%   20814        99.1%
                                              ※ Only supported languages for ldig
                      Short Text Language Detection with Infinity-Gram
                                                                                                66
                                     (NAIST Seminar)
Conclusions
• Language detector using maximal substring model
   – Detect over 99% accuracy for 19 languages.
   – langdetect with tweet corpus even has 97% accuracy.
• If the corpus is maintained, the precision will be still up.
   – There are still many mistakes (in particular da and no)
• If metadata is added to features, the precision will be
  still up.
   – How to add and train metadata at low cost?
• Desire to shrink the model without loss of precision.
   – Too large for application (>100MB)


                  Short Text Language Detection with Infinity-Gram
                                                                     67
                                 (NAIST Seminar)
References
• [中谷 NLP12] 極大部分文字列を使った twitter 言語判定
• [Okanohara+ 09] Text Categorization with All Substring
  Features
• [Brody+ 11] Cooooooooooooooollllllllllllll!!!!!!!!!!!!!!
  Using Word Lengthening to Detect Sentiment in
  Microblogs
• [Cavnar+ 94] N-Gram-Based Text Categorization
• [Tsuruoka+ 09] Stochastic Gradient Descent Training
  for L1-regularized Log-linear Models with Cumulative
  Penalty



                   Short Text Language Detection with Infinity-Gram
                                                                      68
                                  (NAIST Seminar)

Weitere ähnliche Inhalte

Ähnlich wie Short Text Language Detection with Infinity-Gram

Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...CILIP MDG
 
Alp Öktem - 2017 - Automatic Extraction of Parallel Speech Corpora from Dubbe...
Alp Öktem - 2017 - Automatic Extraction of Parallel Speech Corpora from Dubbe...Alp Öktem - 2017 - Automatic Extraction of Parallel Speech Corpora from Dubbe...
Alp Öktem - 2017 - Automatic Extraction of Parallel Speech Corpora from Dubbe...Association for Computational Linguistics
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
From Programming to Modeling And Back Again
From Programming to Modeling And Back AgainFrom Programming to Modeling And Back Again
From Programming to Modeling And Back AgainMarkus Voelter
 
Conversational Speech Translation - Challenges and Techniques, by Chris Wendt...
Conversational Speech Translation - Challenges and Techniques, by Chris Wendt...Conversational Speech Translation - Challenges and Techniques, by Chris Wendt...
Conversational Speech Translation - Challenges and Techniques, by Chris Wendt...TAUS - The Language Data Network
 
Ubiquity: Designing a Multilingual Natural Language Interface
Ubiquity: Designing a Multilingual Natural Language InterfaceUbiquity: Designing a Multilingual Natural Language Interface
Ubiquity: Designing a Multilingual Natural Language InterfaceMichael Yoshitaka Erlewine
 
Modelo ud en power point
Modelo ud en power pointModelo ud en power point
Modelo ud en power pointpolzeath
 
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceMultilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceBasis Technology
 
Evaluation of language identification methods
Evaluation of language identification methodsEvaluation of language identification methods
Evaluation of language identification methodsedma2
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
 
NLP for minority languages
NLP for minority languagesNLP for minority languages
NLP for minority languagesChris Brew
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 

Ähnlich wie Short Text Language Detection with Infinity-Gram (15)

Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...Closing the language gap: developing machine learning tools to detect the lan...
Closing the language gap: developing machine learning tools to detect the lan...
 
Alp Öktem - 2017 - Automatic Extraction of Parallel Speech Corpora from Dubbe...
Alp Öktem - 2017 - Automatic Extraction of Parallel Speech Corpora from Dubbe...Alp Öktem - 2017 - Automatic Extraction of Parallel Speech Corpora from Dubbe...
Alp Öktem - 2017 - Automatic Extraction of Parallel Speech Corpora from Dubbe...
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
From Programming to Modeling And Back Again
From Programming to Modeling And Back AgainFrom Programming to Modeling And Back Again
From Programming to Modeling And Back Again
 
Conversational Speech Translation - Challenges and Techniques, by Chris Wendt...
Conversational Speech Translation - Challenges and Techniques, by Chris Wendt...Conversational Speech Translation - Challenges and Techniques, by Chris Wendt...
Conversational Speech Translation - Challenges and Techniques, by Chris Wendt...
 
Ubiquity: Designing a Multilingual Natural Language Interface
Ubiquity: Designing a Multilingual Natural Language InterfaceUbiquity: Designing a Multilingual Natural Language Interface
Ubiquity: Designing a Multilingual Natural Language Interface
 
Esa act
Esa actEsa act
Esa act
 
Modelo ud en power point
Modelo ud en power pointModelo ud en power point
Modelo ud en power point
 
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search ConferenceMultilingual Search and Text Analytics with Solr - Open Source Search Conference
Multilingual Search and Text Analytics with Solr - Open Source Search Conference
 
Evaluation of language identification methods
Evaluation of language identification methodsEvaluation of language identification methods
Evaluation of language identification methods
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
NLP for minority languages
NLP for minority languagesNLP for minority languages
NLP for minority languages
 
About programming languages
About programming languagesAbout programming languages
About programming languages
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 

Mehr von Shuyo Nakatani

画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15
画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15
画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15Shuyo Nakatani
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networksShuyo Nakatani
 
無限関係モデル (続・わかりやすいパターン認識 13章)
無限関係モデル (続・わかりやすいパターン認識 13章)無限関係モデル (続・わかりやすいパターン認識 13章)
無限関係モデル (続・わかりやすいパターン認識 13章)Shuyo Nakatani
 
Memory Networks (End-to-End Memory Networks の Chainer 実装)
Memory Networks (End-to-End Memory Networks の Chainer 実装)Memory Networks (End-to-End Memory Networks の Chainer 実装)
Memory Networks (End-to-End Memory Networks の Chainer 実装)Shuyo Nakatani
 
人工知能と機械学習の違いって?
人工知能と機械学習の違いって?人工知能と機械学習の違いって?
人工知能と機械学習の違いって?Shuyo Nakatani
 
RとStanでクラウドセットアップ時間を分析してみたら #TokyoR
RとStanでクラウドセットアップ時間を分析してみたら #TokyoRRとStanでクラウドセットアップ時間を分析してみたら #TokyoR
RとStanでクラウドセットアップ時間を分析してみたら #TokyoRShuyo Nakatani
 
ドラえもんでわかる統計的因果推論 #TokyoR
ドラえもんでわかる統計的因果推論 #TokyoRドラえもんでわかる統計的因果推論 #TokyoR
ドラえもんでわかる統計的因果推論 #TokyoRShuyo Nakatani
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...Shuyo Nakatani
 
星野「調査観察データの統計科学」第3章
星野「調査観察データの統計科学」第3章星野「調査観察データの統計科学」第3章
星野「調査観察データの統計科学」第3章Shuyo Nakatani
 
星野「調査観察データの統計科学」第1&2章
星野「調査観察データの統計科学」第1&2章星野「調査観察データの統計科学」第1&2章
星野「調査観察データの統計科学」第1&2章Shuyo Nakatani
 
言語処理するのに Python でいいの? #PyDataTokyo
言語処理するのに Python でいいの? #PyDataTokyo言語処理するのに Python でいいの? #PyDataTokyo
言語処理するのに Python でいいの? #PyDataTokyoShuyo Nakatani
 
Zipf? (ジップ則のひみつ?) #DSIRNLP
Zipf? (ジップ則のひみつ?) #DSIRNLPZipf? (ジップ則のひみつ?) #DSIRNLP
Zipf? (ジップ則のひみつ?) #DSIRNLPShuyo Nakatani
 
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...Shuyo Nakatani
 
ソーシャルメディアの多言語判定 #SoC2014
ソーシャルメディアの多言語判定 #SoC2014ソーシャルメディアの多言語判定 #SoC2014
ソーシャルメディアの多言語判定 #SoC2014Shuyo Nakatani
 
猫に教えてもらうルベーグ可測
猫に教えてもらうルベーグ可測猫に教えてもらうルベーグ可測
猫に教えてもらうルベーグ可測Shuyo Nakatani
 
アラビア語とペルシャ語の見分け方 #DSIRNLP 5
アラビア語とペルシャ語の見分け方 #DSIRNLP 5アラビア語とペルシャ語の見分け方 #DSIRNLP 5
アラビア語とペルシャ語の見分け方 #DSIRNLP 5Shuyo Nakatani
 
どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013
どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013
どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013Shuyo Nakatani
 
Active Learning 入門
Active Learning 入門Active Learning 入門
Active Learning 入門Shuyo Nakatani
 
数式を綺麗にプログラミングするコツ #spro2013
数式を綺麗にプログラミングするコツ #spro2013数式を綺麗にプログラミングするコツ #spro2013
数式を綺麗にプログラミングするコツ #spro2013Shuyo Nakatani
 
ノンパラベイズ入門の入門
ノンパラベイズ入門の入門ノンパラベイズ入門の入門
ノンパラベイズ入門の入門Shuyo Nakatani
 

Mehr von Shuyo Nakatani (20)

画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15
画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15
画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
無限関係モデル (続・わかりやすいパターン認識 13章)
無限関係モデル (続・わかりやすいパターン認識 13章)無限関係モデル (続・わかりやすいパターン認識 13章)
無限関係モデル (続・わかりやすいパターン認識 13章)
 
Memory Networks (End-to-End Memory Networks の Chainer 実装)
Memory Networks (End-to-End Memory Networks の Chainer 実装)Memory Networks (End-to-End Memory Networks の Chainer 実装)
Memory Networks (End-to-End Memory Networks の Chainer 実装)
 
人工知能と機械学習の違いって?
人工知能と機械学習の違いって?人工知能と機械学習の違いって?
人工知能と機械学習の違いって?
 
RとStanでクラウドセットアップ時間を分析してみたら #TokyoR
RとStanでクラウドセットアップ時間を分析してみたら #TokyoRRとStanでクラウドセットアップ時間を分析してみたら #TokyoR
RとStanでクラウドセットアップ時間を分析してみたら #TokyoR
 
ドラえもんでわかる統計的因果推論 #TokyoR
ドラえもんでわかる統計的因果推論 #TokyoRドラえもんでわかる統計的因果推論 #TokyoR
ドラえもんでわかる統計的因果推論 #TokyoR
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
 
星野「調査観察データの統計科学」第3章
星野「調査観察データの統計科学」第3章星野「調査観察データの統計科学」第3章
星野「調査観察データの統計科学」第3章
 
星野「調査観察データの統計科学」第1&2章
星野「調査観察データの統計科学」第1&2章星野「調査観察データの統計科学」第1&2章
星野「調査観察データの統計科学」第1&2章
 
言語処理するのに Python でいいの? #PyDataTokyo
言語処理するのに Python でいいの? #PyDataTokyo言語処理するのに Python でいいの? #PyDataTokyo
言語処理するのに Python でいいの? #PyDataTokyo
 
Zipf? (ジップ則のひみつ?) #DSIRNLP
Zipf? (ジップ則のひみつ?) #DSIRNLPZipf? (ジップ則のひみつ?) #DSIRNLP
Zipf? (ジップ則のひみつ?) #DSIRNLP
 
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
 
ソーシャルメディアの多言語判定 #SoC2014
ソーシャルメディアの多言語判定 #SoC2014ソーシャルメディアの多言語判定 #SoC2014
ソーシャルメディアの多言語判定 #SoC2014
 
猫に教えてもらうルベーグ可測
猫に教えてもらうルベーグ可測猫に教えてもらうルベーグ可測
猫に教えてもらうルベーグ可測
 
アラビア語とペルシャ語の見分け方 #DSIRNLP 5
アラビア語とペルシャ語の見分け方 #DSIRNLP 5アラビア語とペルシャ語の見分け方 #DSIRNLP 5
アラビア語とペルシャ語の見分け方 #DSIRNLP 5
 
どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013
どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013
どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013
 
Active Learning 入門
Active Learning 入門Active Learning 入門
Active Learning 入門
 
数式を綺麗にプログラミングするコツ #spro2013
数式を綺麗にプログラミングするコツ #spro2013数式を綺麗にプログラミングするコツ #spro2013
数式を綺麗にプログラミングするコツ #spro2013
 
ノンパラベイズ入門の入門
ノンパラベイズ入門の入門ノンパラベイズ入門の入門
ノンパラベイズ入門の入門
 

Kürzlich hochgeladen

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Kürzlich hochgeladen (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Short Text Language Detection with Infinity-Gram

  • 1. Short Text Language Detection with Infinity-Gram 2012/05/14 NAIST Seminar Nakatani Shuyo @ Cybozu Labs Inc
  • 2. Agenda • Language Detection • Proposal Method – Maximal Substring • Corpus • Implementation and Estimations • Conclusions Short Text Language Detection with Infinity-Gram 4 (NAIST Seminar)
  • 3. Language Detection Short Text Language Detection with 5 Infinity-Gram (NAIST Seminar)
  • 4. In What Language? • Ik kan er nooit tegen als mensen me negeren. • Aha ich seh angeblich süß aus • Czy mógłbym zasnąć w przedmieściach Twoich myśli? • Ah. Tak. Så skal jeg bare finde ud af *hvordan*! • Det er ikke så digg nei å vi som har finale til helga....Skrekk og gru! Takk :) • tack kompis! Hade faktiskt tänkt maila dig på fb och fråga vart du tog vägen! • Çok doğru. En büyük hatayı yaptım. • Încântat de cunoștință. • Một người dân bị thương và bốn người mất tích sau khi một ngọn núi lửa ở miền trung... Detection with Infinity-Gram Short Text Language 6 (NAIST Seminar)
  • 5. Hints • Dutch if there is 'ik' • German if there is 'ich' or a letter 'ß' • Polish if there is 'czy' or letters 'Ł', 'ń', 'ś' or 'ź' • Scandinavian if there is a letter 'å' – Danish if there is 'af.' 'Tak' means 'thanks.' – Norwegian if there is 'nei.' 'Takk' means 'thanks.' – Swedish if there is "och." 'Tack' means 'thanks.' • Turkish if there is a letter 'ı' ( 'i' without point) or 'ğ' • Romanian if there is a letter 'ă' or 'ș' or 'ț' – Although 'ă' is also used in Vietnamese, it is easy to distinguish them. – Although 'ş' is also used in Turkish, it is easy to distinguish them. • Vietnamese if there are many unreadable letters on WinXP :P Short Text Language Detection with Infinity-Gram 7 (NAIST Seminar)
  • 6. In What Language? (Solution) • Ik kan er nooit tegen als mensen me negeren. Dutch • Aha ich seh angeblich süß aus German • Czy mógłbym zasnąć w przedmieściach Twoich myśli? Polish • Ah. Tak. Så skal jeg bare finde ud af *hvordan*! Danish • Det er ikke så digg nei å vi som har finale til helga....Skrekk og gru! Takk :) Norwegian • tack kompis! Hade faktiskt tänkt maila dig på fb och fråga vart du tog vägen! Swedish • Çok doğru. En büyük hatayı yaptım. Turkish • Încântat de cunoștință. Rumanian • Một người dân bị thương và bốn người mất tích sau khi một ngọn núi lửa ở miền trung... Detection with Infinity-Gram Short Text Language Vietnamese 8 (NAIST Seminar)
  • 7. What's Language Detection • To detect what language the input text written in – Time fries like arrow → English – Buona sera! → Italian • It is prior for many language processing tasks – Language model is built for each language – Text search, classification, extraction, translation, ... • It is possible to detect for long enough and noiseless text with more than 99% accuracy [Cavnar+ 94] – 3-gram model is used in many methods Short Text Language Detection with Infinity-Gram 9 (NAIST Seminar)
  • 8. SPAM or not? • It is necessary to know that it is written in Polish. Short Text Language Detection with Infinity-Gram 10 (NAIST Seminar)
  • 9. Document Categorization with Naive Bayes Classifier • Categorize a document 𝑋 = (𝑋 𝑖 ) into category 𝐶 𝑘 – A document 𝑋 is represented as collection of words 𝑋 𝑖 (bag-of-words) • Word probability assumes conditionally independent on each category – 𝑝 𝑋 𝐶𝑘 = 𝑖 𝑝 𝑋 𝑖 𝐶k (from independent hypothesis) – where 𝑝(𝑋 𝑖 |𝐶) : rate of word frequency for category • Estimate the category 𝐶k to maximize posterior 𝑝 𝑋 𝐶k 𝑝 𝐶k – 𝑝 𝐶k 𝑋 = ∝ 𝑝(𝐶k ) 𝑖 𝑝(𝑋 𝑖 |𝐶k ) 𝑝 𝑋 – where 𝑝(𝐶k ) : prior for category Short Text Language Detection with Infinity-Gram 11 (NAIST Seminar)
  • 10. Language Detection with Naive Bayes Classifier • Document categorization with language labels – Categorize documents into 'English', 'Japanese' and so on • Use character n-gram as features – "Unicode code point n-gram", strictly speaking – Assume character encoding of the document is already known • Most applications know encoding of inside text data Short Text Language Detection with Infinity-Gram 12 (NAIST Seminar)
  • 11. Why Use n-Gram to Detect Language • Each language has proper characters and spelling rules – “é” is often used in Spanish, Italian and so on, but not in English in principle – There are many words which start with “Z” in German, but not in English – There are many words which start with “C” in English, but not in German – Spelling “Th” is often used in English, but not in the other languages □C □L □Z Th □T h i s □ English 0.75 0.47 0.02 0.74 T h i s ←1-gram German 0.10 0.37 0.53 0.03 □T Th hi is s□ ←2-gram French 0.38 0.69 0.01 0.01 □Th Thi his is□ ←3-gram Short Text Language Detection with Infinity-Gram 13 (NAIST Seminar)
  • 12. language-detection(langdetect) (Nakatani 2010) • Language detection library for Java – http://code.google.com/p/language-detection/ – Apache License 2.0 – Character 3-gram + Bayesian filter – Various normalizations + Feature sampling • 99% over precision for 53 languages – Training with Wikipedia abstract – Widely support including Asian languages – Adopted by Apache Solr Short Text Language Detection with Infinity-Gram 14 (NAIST Seminar)
  • 13. Estimation with News Text Language size accuracy Language size accuracy af Afrikaans 200 199 (99.50%) mr Marathi 200 200 (100.00%) ar Arabic 200 200 (100.00%) ne Nepali 200 200 (100.00%) bg Bulgarian 200 200 (100.00%) nl Dutch 200 200 (100.00%) bn Bengali 200 200 (100.00%) no Norwegian 200 199 (99.50%) cs Czech 200 200 (100.00%) pa Punjabi 200 200 (100.00%) da Dannish 200 179 (89.50%) pl Polish 200 200 (100.00%) de German 200 200 (100.00%) pt Portuguese 200 200 (100.00%) el Greek 200 200 (100.00%) ro Romanian 200 200 (100.00%) en English 200 200 (100.00%) ru Russian 200 200 (100.00%) es Spanish 200 200 (100.00%) sk Slovak 200 200 (100.00%) fa Persian 200 200 (100.00%) so Somali 200 200 (100.00%) fi Finnish 200 200 (100.00%) sq Albanian 200 200 (100.00%) fr French 200 200 (100.00%) sv Swedish 200 200 (100.00%) gu Gujarati 200 200 (100.00%) sw Swahili 200 200 (100.00%) he Hebrew 200 200 (100.00%) ta Tamil 200 200 (100.00%) hi Hindi 200 200 (100.00%) te Telugu 200 200 (100.00%) hr Croatian 200 200 (100.00%) th Thai 200 200 (100.00%) hu Hungarian 200 200 (100.00%) tl Tagalog 200 200 (100.00%) id Indonesian 200 200 (100.00%) tr Turkish 200 200 (100.00%) it Italian 200 200 (100.00%) uk Ukrainian 200 200 (100.00%) ja Japanese 200 200 (100.00%) ur Urdu 200 200 (100.00%) kn Kannada 200 200 (100.00%) vi Vietnamese 200 200 (100.00%) ko Korean 200 200 (100.00%) zh-cn Simplified Chinese 200 200 (100.00%) mk Macedonian 200 200 (100.00%) zh-tw Traditional Chinese 200 200 (100.00%) ml Malayalam 200 200 (100.00%) total 9800 9777 (99.77%) • Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram 15 (NAIST Seminar)
  • 14. Estimation with Europarl datasets language size correct accuracy bg Bulgarian 1000 988 98.8% • Test for 1000 samples for each cs Czech 1000 994 99.4% da Dannish 1000 968 96.8% language from Europarl Parallel Corpus de German 1000 998 99.8% – from the proceedings of the European Parliament el Greek 1000 1000 100.0% en English 1000 996 99.6% – http://www.statmt.org/europarl/ es Spanish 1000 996 99.6% et Estonian 1000 996 99.6% • http://code.google.com/p/language- fi Finnish 1000 998 99.8% detection/downloads/detail?name=eur fr French 1000 999 99.9% hu Hungarian 1000 999 99.9% oparl-test.zip it Italian 1000 999 99.9% lt Lithuanian 1000 997 99.7% lv Latvian 1000 999 99.9% nl Dutch 1000 974 97.4% pl Polish 1000 999 99.9% pt Portuguese 1000 996 99.6% ro Romanian 1000 999 99.9% sk Slovak 1000 988 98.8% sl Slovene 1000 976 97.6% sv Swedish 1000 991 99.1% total 21000 20850 99.3% Short Text Language Detection with Infinity-Gram 16 (NAIST Seminar)
  • 15. Language Detection has been over, isn't it? 17
  • 16. We still have ENEMY to beat! Short Text Language Detection with Infinity-Gram 18 (NAIST Seminar)
  • 17. Twitter Language Detection with the Existing Methods • Only 90-95% accuracy language LD CLD Tika ca Catalan 95.3 93.0 83.8 for tweet corpus cs Czech 96.3 96.6 ---- da Dannish 94.5 90.7 58.7 de German 86.6 96.8 73.1 en English 88.3 97.4 54.7 es Spanish 91.5 90.5 44.4 • LD = language-detection fi Finnish 98.9 99.4 94.8 fr French 95.0 94.5 67.4 • CLD = Chromium Compact Language hu Hungarian 85.8 89.0 76.2 Detection id Indonesian 89.7 92.8 ---- it Italian 96.2 93.8 87.1 – http://code.google.com/p/chromium- nl Dutch 69.5 93.2 65.0 compact-language-detector/ no Norwegian 96.0 74.9 68.6 – regard ms(Malay) as id(Indonesian) pl Polish 98.0 97.8 88.8 pt Portuguese 88.0 88.6 47.4 • Tika = Apache Tika ro Romanian 92.8 96.1 82.6 sv Swedish 96.0 96.4 75.6 – http://tika.apache.org/ tr Turkish 97.6 97.4 ---- – Estimate on 15 languages which Tika vi Vietnamese 98.7 98.9 ---- supports in our tweet corpus total 92.2 93.8 70.0 Short Text Language Detection with Infinity-Gram 19 (NAIST Seminar)
  • 18. Chromium Compact Language Detection (CLD) • Porting the language detector from Google Chromium – http://code.google.com/p/chromium-compact-language-detector/ – Implementation in C++, Python binding – # of supported languages : CLD = 76, langdetect = 53 – Accuracy : CLD = 98.82%, langdetect = 99.22% • for 17 languages on Europarl datasets • http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html Short Text Language Detection with Infinity-Gram 20 (NAIST Seminar)
  • 19. Is twitter Language Detection difficult? (1) • Tweet is too short to extract 3-gram features – At most 140 characters on twitter – URLs, mentions and hashtags are not useful to detect • LIGA [Tromp+ 11] – Graph-features based on 3-gram • Add long distance features • 95~98% accuracy for twitter Language Detection • 6 languages (de, en, es, fr, it, nl) Short Text Language Detection with Infinity-Gram 21 (NAIST Seminar)
  • 20. Is twitter Language Detection difficult? (2) • Tweet is too noisy – Representations against the language's orthography often appear – Acronym, Abbreviation, lengthened word (like 'Cooooolll') • Likelihood of tweet tends to get smaller on normal language model OMG Oh My God u you LOL Laughing Out Loud ur your Letter 'k' isn't used in Italian LMAO Laughing My Ass Out 4 for F4F Follow for Follow i0u I love you MDR Mort de Rire (French) k che (Italian) TKT Ne t‘Inquiète Pas (Fr) anke anche(Italian) Short Text Language Detection with Infinity-Gram 22 (NAIST Seminar)
  • 21. Motivation to Detect Short Text Language • There are many small chunks of text in addition to twitter – Schedule, search query, bulletin board and so on – There are many questions about short text detection in the Issues Board of langdetect Project • http://code.google.com/p/language-detection/issues/detail?id=10 • Detection for multi-language mixed text – Cut the target document in paragraphs or lines – Detect for each short text Short Text Language Detection with Infinity-Gram 23 (NAIST Seminar)
  • 22. Our Goal • Over 99% accuracy – However it is too difficult to detect "one word sentence"... – Our Goal is 99%+ accurate detection for "sentence with more than 3 words" Short Text Language Detection with Infinity-Gram 24 (NAIST Seminar)
  • 23. We need • Rich feature extractable model from short text, – Maximal substring model (∞-gram Logistic Regression) • and twitter-specific Language model or Corpus to construct it. – about 700K tweet corpus with language label Short Text Language Detection with Infinity-Gram 25 (NAIST Seminar)
  • 24. Proposal Method Short Text Language Detection with 26 Infinity-Gram (NAIST Seminar)
  • 25. How to increase features from 3-grams # of n-gram gram freq≧1 freq≧2 freq≧10 • The more n, the 1 79 72 57 more features 2 1896 1533 902 3 15970 10369 4525 • Maximum at 4 64966 33941 10534 n=∞, that is all 5 167543 69719 15538 substring 6 323749 107861 18970 – But it has O(T2) 7 524634 142954 21093 order 8 760719 171995 22159 9 921361 193995 22696 : : : : ※ cumulative distributuion of feature length for 5090 normalized English tweets (300KB) Short Text Language Detection with Infinity-Gram 27 (NAIST Seminar)
  • 26. Text Categorization with All Substring Features [Okanohara+ 09] • Multiclass Logistic Regression using all substrings as features – Maximal Substring makes the equivalent model that can be constructed in linear time – Store features into TRIE, fast prediction Short Text Language Detection with Infinity-Gram 28 (NAIST Seminar)
  • 27. Maximal Substring (1) • Define a containment(semi-order) among non empty substrings abracadabra – “ra” ⊂ “bra“ ⇔ all ”ra” occur as the substring of “bra” – “a” ⊄ “ra“ ⇔ “a” occur in not only “ra“ but also “ca” ※It is strictly defined with also its position in the substring. Short Text Language Detection with Infinity-Gram 29 (NAIST Seminar)
  • 28. Maximal Substring (2) via http://d.hatena.ne.jp/nokuno/20120203/1328237067 • Each equivalent class formed by the containment relationship has a unique maximal element, that is named "Maximal Substring". • Maximal substrings of "abracadabra" are "a", "abra" and "abracadabra". Short Text Language Detection with Infinity-Gram 30 (NAIST Seminar)
  • 29. Maximal Substring and Infinity-Gram • Frequencies of substrings that have a containment relationship always equal. • In the model with linear combination of features, it is possible to enclose the common feature values. • Logistic regression with maximal substrings is equivalent to the one with infinity-grams. ※ Although the equivalence collapses for test set, we assumes that it can be approximated by a sufficiently large training set. Short Text Language Detection with Infinity-Gram 31 (NAIST Seminar)
  • 30. Extended Suffix Array • Extended Suffix Array consists of – SA=Suffix Array, – L=Longest Common Prefixes, – B=Burrows-Wheeler's Transformed text. • A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree, which is equivalent to a suffix with L>0 and BWT has more than 1 character type. – They can be calculated on linear time. • esaxx : Okanohara's implement of ESA – http://code.google.com/p/esaxx/ Short Text Language Detection with Infinity-Gram via [Okanohara+ 09] 32 (NAIST Seminar)
  • 31. Corpus and Normalization Short Text Language Detection with 33 Infinity-Gram (NAIST Seminar)
  • 32. Target Languages • Limit character type to detect – In short text detection, mixed text can be divided to type of characters • Latin alphabet language – The most difficult alphabet type to detect – Languages which speakers are over 5 million are more than 25. Short Text Language Detection with Infinity-Gram 34 (NAIST Seminar)
  • 33. What's Latin Alphabet? • Latin alphabet ≠ ascii alphabet – å, ą, æ, ð, Ħ, ŋ and so on... • They are assigned to 9 code blocks in Unicode Range Name Supplement U+0000-007F Basic Latin ascii U+0080-00FF Latin-1 Supplement Most languages are covered U+0100-017F Latin Extended-A with these. U+0180-024F Latin Extended-B Rumanian U+0250-02AF IPA Extensions U+0300-036F Combining Diacritical Marks for tone symbol composition U+1E00-1EFF Latin Extended Additional Vietnamese U+2C60-2C7F Latin Extended-C These aren’t used by almost U+A720-A7FF Latin Extended-D all present languages Short Text Language Detection with Infinity-Gram 35 (NAIST Seminar)
  • 34. Latin Alphabets in Unicode Codepoint Chart use often use sometimes for Vietnamese only Short Text Language Detection with Infinity-Gram 36 (NAIST Seminar)
  • 35. How to Create Corpus • Collect tweets with 'sample' method of twitter Streaming API – Sampling 1% of all tweets (about 2 million tweets). – Tweets in Latin alphabet language account for 60% of them. • The rest is only to annotate language labels to these tweets Short Text Language Detection with Infinity-Gram 37 (NAIST Seminar)
  • 36. Language Label Annotation • Group tweets by their timezone – French tweets account for about 1% of all ones – But they account for 50% of ones in Paris timezone only • Annotate tentative labels to tweets using langdetect – Remove non-French tweets from ones labeled ‘fr’ – Recover French tweets from ones not labeled ‘fr’ (※ 20% of the whole tweets have no timezone) Short Text Language Detection with Infinity-Gram 38 (NAIST Seminar)
  • 37. How to annotate Swedish, Norwegian, Danish, Vietnamese, Lithuanian, Czech, Hungarian, Catalan, Rumanian and Polish guides in turn Short Text Language Detection with Infinity-Gram 39 (NAIST Seminar)
  • 38. Created Corpus language training test ca Catalan 9,089 5,082 cs Czech 9,082 7,682 da Dannish 7,388 5,524 de German 44,448 10,065 en English 44,520 10,168 es Spanish 44,118 10,265 fi Finnish 8,087 7,050 fr French 44,339 10,098 hu Hungarian 10,030 4,904 id Indonesian 44,722 10,181 it Italian 43,366 10,152 nl Dutch 44,682 10,007 • Noiseless tweets for training no Norwegian 10,124 8,496 data pl Polish 16,771 10,152 pt Portuguese 44,215 10,208 • Noiseful tweets with more ro Romanian 10,021 5,911 than 3 words as test data sv Swedish 44,054 10,032 tr Turkish 44,703 10,308 • Work with Raúl Velaz and vi Vietnamese total 15,030 538,789 10,488 166,773 Hiroshi Manabe for Catalan corpus creation Short Text Language Detection with Infinity-Gram 40 (NAIST Seminar)
  • 39. Simple Language Detection • Language detector can be constructed from maximal substring model and twitter corpus – It still gets at most 98% accuracy. • We guess it is necessary to reduce bias. – data size bias – language-specific bias – twitter-specific bias Short Text Language Detection with Infinity-Gram 41 (NAIST Seminar)
  • 40. Bias by Data Size • Tweet size in each language has huge bias. • Level them out by sampling with replacement from each language up to the largest data – It actually approximates to copy the integer multiple of data and sample the rest without replacement English Portuguese Spanish Indonesian Dutch French German Turkish Italian Swedish others Short Text Language Detection with Infinity-Gram 42 (NAIST Seminar)
  • 41. Convert to Lowercase on Multiple Languages • Conversion into lower case saves corpus and compresses model. • But the lower case of I (U+0049) in Turkish differs from others. • Convert to lower case excluding ‘I’ Upper case Lower case Turkish I (U+0049) ı (U+0131) Azerbaijani İ (U+0130) i (U+0069) Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram 43 (NAIST Seminar)
  • 42. Normalization for Rumanian • Rumanian uses â, ă, î, ș, ț in addition to a-z • There are 2 character type as s/t with a “beard” – U+015E-F, U+0162-3 : s/t with cedilla – U+0218-B : s/t with comma below • ‘s/t with cedilla’ is more popular on news, twitter and Wikipedia. • The 2 code has the same design in some fonts... – Indistinguishable!! ș ş ț ţ U+0219 U+015F U+021B U+0163 Short Text Language Detection with Infinity-Gram 44 (NAIST Seminar)
  • 43. Rumanian Character Affairs on PC • Although Romanian orthography provided that ‘s/t with comma’ must be used, they was not available to PC until recently. – 1989 Democratization in Rumania – 2001 ‘s/t with comma’ was provided by ISO8859-16(Latin-10) and Unicode – 2007 Rumania seated in the EU – 2007 Windows Vista supported ‘s/t with comma’ (available for everyone!) ‘s/t with cedilla’ is used on an advertisement board in Bucharest Short Text Language Detection with Infinity-Gram 45 (NAIST Seminar)
  • 44. Normalization for Substitute Characters • ‘s/t with cedilla’ are substitute characters – But they are more popular than the others – with cedilla : with comma = 2 : 1 – “Rumanian IME” outputs the substitutes too :D • Regard ‘s/t with comma’ as ‘s/t with cedilla’ ț ţ I reckon it is similar to the relationship of Japanese character ‘SA’!! U+021B Short Text ささ U+0163 Language Detection with Infinity-Gram (NAIST Seminar) 46
  • 45. Arabic Character Normalization (on language-detection) • Arabic and Persian have the similar trouble too. • Character ‘yeh’ in Farsi corresponds to 2 code points. – Wikipedia uses ‫( ی‬U+06cc, Farsi yeh) only – News uses ‫(ي‬U+064a, Arabic yeh) only • U+064a is a substitute in Farsi – The popular Arabic charset CP-1256 has no character mapped into U+06cc – As ‘yeh’ is very often used in both languages, quite all Persian text detection fails • Regard U+06cc as U+064a Short Text Language Detection with Infinity-Gram 47 (NAIST Seminar)
  • 46. Normalization for Vietnamese (1) • Vietnamese has 12 vowels – a, ă, â, e, ê, i, y, o, ô, ơ, u, ư • Vietnamese has 6 tones – a, ả, à, ã, á, ạ – These tone symbols are used also in general documents like news. • The tone symbols can be appended to all vowels – 12 * 6 = 72 Short Text Language Detection with Infinity-Gram 48 (NAIST Seminar)
  • 47. Normalization for Vietnamese (2) • Representation of vowels with tones 1. Use U+1ea0 - U+1ef9 • ẵ = U+1eb5 2. Combine with Diacritical Marks • ẵ = U+0103 U+0303 – Half and half on news and tweet • Normalize 2 into 1 Short Text Language Detection with Infinity-Gram 49 (NAIST Seminar)
  • 48. CJK-Kanji Normalization (1) (on language-detection) • CJK-Kanji has too many characters(more than 20K) – Other character types have only 30-50 characters. • The character space is very sparse. – Characters that don’t occur in the training corpus have no probabilities. • e.g. "谢谢", Kanji for person name – Common frequent characters are too strong. • e.g. : a text which has ”的” tends to be detected as Traditional Chinese • Hence Kana is used in Japanese too, the probabilities of Kanji in Japanese are less than ones in Chinese. Short Text Language Detection with Infinity-Gram 50 (NAIST Seminar)
  • 49. CJK-Kanji Normalization (2) (on language-detection) • Group Kanjis by frequency and normalize each group to the representative character – (1) K-means clustering • Use tf-idf on Wikipedia and Google News • K=50 (size of ascii alphabet = 52) – (2) “Commonly Used Kanji” provided in Japanese and Chinese • Simplified Chinese : 现代汉语常用字表(3500) • Traditional Chinese :常用国字標準字体表(4808) ⊂ Big5 the first standard(5401) • Japanese : 常用漢字(2136)∪ JIS the first standard(2965) = 2998 – 常用漢字 doesn’t have Kanji for person name and place name very much • Generate 130 clusters from product of (1) and (2) Short Text Language Detection with Infinity-Gram 51 (NAIST Seminar)
  • 50. Normalization for twitter • Remove simply – URL – mention – hash tag – RT – face mark using alphabet like XD, :p Short Text Language Detection with Infinity-Gram 52 (NAIST Seminar)
  • 51. Normalization for twitter-Specific Representation • How to Like ‘coooooooollllll’ • Case 1: Make a normalization dictionary using [Brody+ 2011] – Unsupervised normalization like coooollll → cool – It can’t handle words that are not in the dictionary • Case 2: If the same character continues in more than 3, Shrink it to 2 – There is no language which over 3 continuation of the same Latin alphabet in orthography of. • If in Japanese, there are “かたたたき”, “かわいいいぬ”, “あわてて て” and so on. • Acronym (like WWW, СССР) is not useful for language detection Short Text Language Detection with Infinity-Gram 53 (NAIST Seminar)
  • 52. Laugh Normalization • There are various laughs on each language – HOW MUCH DO YOU LOVE COACH BEISTE??? HHAHAHAHAHAH – Hihihihi. :) Habe ich regulär 2x die Woche! – Tafil con eso...!!! Jajajajajajaja – Malo?? Jejejeje XP – kekeke chỗ đó làm áo được ko em? • Shrink them to double – hahahha ⇒ haha Short Text Language Detection with Infinity-Gram 54 (NAIST Seminar)
  • 53. Implementation and Estimation Short Text Language Detection with 55 Infinity-Gram (NAIST Seminar)
  • 54. Language Detection with Infinity-Gram (ldig) • tweet language detection for Latin alphabet – https://github.com/shuyo/ldig • MIT license • Distribute also the trained model here – ∞-gram LR(maximal substring) [Okanohara+ 09] – L1 SGD (Cumulative Penalty) [Tsuruoka+ 09] – Double Array Short Text Language Detection with Infinity-Gram 56 (NAIST Seminar)
  • 55. Usage (1) Model Initialization • ldig.py -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency] – Extract features from corpus and initialize model – -m : model directory – -x : path of maximal substring extractor (execute as external process) – --ff : Ignore less than the specified value Short Text Language Detection with Infinity-Gram 57 (NAIST Seminar)
  • 56. Maximal String Extractor • maxsubst [input file] [output file] – Input as multiple line text • Replace TABs to “ “, line feeds to U+0001 in it – Output as ”[features]¥t[frequency]” Short Text Language Detection with Infinity-Gram 58 (NAIST Seminar)
  • 57. Usage (2) Learn • ldig.py -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization] – Learn the model using the corpus on 1 cycle of SGD – -e : learning rate of SGD – -r : regularizer of L1 regularization – --wr : what times to regularize for whole parameters • Parameters are too many to regularize the whle ones every step Short Text Language Detection with Infinity-Gram 59 (NAIST Seminar)
  • 58. Usage (3) Shrink Model • ldig.py -m [model] --shrink – Remove Unefficient features(all parameters of which are 0) from the model Short Text Language Detection with Infinity-Gram 60 (NAIST Seminar)
  • 59. Usage (4) Detect Language • ldig.py -m [model] [test data] – Detect languages of test data and output its result and summary Short Text Language Detection with Infinity-Gram 61 (NAIST Seminar)
  • 60. Data Format • Training and test data – [correct label]¥t[meta data]¥t[text] en u should just enjoy ur vacation sadly en :D i'm online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL ca [status ID] [datetime] [userID] [language of UI] @xxx xDDD no m'extranya... Tal volta haguera segut millor per a la humanitat que no l'haguera vist... you know.. xDD Short Text Language Detection with Infinity-Gram 62 (NAIST Seminar)
  • 61. Usage (5) Estimation Tool • server.py -m [model] -p [port number] – Open http://localhost:[port] after it is executed – Output their language probabilities, contained features and their parameters for a text inputed in the text area Short Text Language Detection with Infinity-Gram 63 (NAIST Seminar)
  • 62. Estimation language size detect correct precision recall LD53 LDsm ca Catalan 5,093 4,923 4,857 98.66 95.37 95.3 97.0 cs Czech 7,681 7,668 7,663 99.93 99.77 96.3 99.7 da Dannish 5,516 5,472 5,310 97.04 96.27 94.5 92.4 de German 10,060 10,069 10,006 99.37 99.46 86.6 93.8 en English 10,162 10,133 10,029 98.97 98.69 88.3 95.0 es Spanish 10,244 10,284 10,120 98.41 98.79 91.5 96.0 fi Finnish 7,051 7,038 7,024 99.80 99.62 98.9 99.6 fr French 10,074 10,134 10,051 99.18 99.77 95.0 98.1 hu Hungarian 4,904 4,892 4,858 99.30 99.06 85.8 95.5 id Indonesian 10,178 10,225 10,160 99.36 99.82 89.7 98.9 it Italian 10,143 10,205 10,103 99.00 99.61 96.2 98.0 nl Dutch 10,005 9,916 9,858 99.42 98.53 69.5 97.4 no Norwegian 8,504 8,432 8,201 97.26 96.44 96.0 96.3 pl Polish 10,151 10,149 10,130 99.81 99.79 98.0 99.7 pt Portuguese 10,212 10,201 10,119 99.20 99.09 88.0 96.9 ro Romanian 5,913 5,867 5,850 99.71 98.93 92.8 97.4 sv Swedish 10,025 10,093 9,942 98.50 99.17 96.0 97.9 tr Turkish 10,308 10,317 10,298 99.82 99.90 97.6 99.5 vi Vietnamese 10,487 10,480 10,474 99.94 99.88 98.7 99.2 total 166,711 165,053 99.01 92.2 97.4 LD53 = langdetect + standard bundled profiles, LDsm = langdetect + profiles based on twitter corpus As a text with maximum probability < 0.6 is treated undetectablely, the sum of detect is less than the sum of size Short Text Language Detection with Infinity-Gram 64 (NAIST Seminar)
  • 63. Estimation for LIGA dataset • Estimate using LIGA[Tromp+ 11] dataset with 9066 tweets for 6 languages – http://www.win.tue.nl/~mpechen/projects/smm/ Language size detect correct precision recall de German 1479 1476 1469 99.5 99.3 en English 1505 1502 1490 99.2 99.0 es Spanish 1562 1548 1541 99.6 98.7 fr French 1551 1549 1540 99.4 99.3 it Italian 1539 1531 1528 99.8 99.3 nl Dutch 1430 1429 1424 99.7 99.6 total 9066 8992 99.2 ※ Use 19 language model Short Text Language Detection with Infinity-Gram 65 (NAIST Seminar)
  • 64. Estimation for Europarl Dataset ldig langdetect CLD language size correct rate correct rate correct rate bg Bulgarian 1000 988 98.8% 991 99.1% cs Czech 1000 1000 100.0% 994 99.4% 995 99.5% da Dannish 1000 976 97.6% 968 96.8% 932 93.2% de German 1000 999 99.9% 998 99.8% 1000 100.0% el Greek 1000 1000 100.0% 1000 100.0% en English 1000 999 99.9% 996 99.6% 1000 100.0% es Spanish 1000 1000 100.0% 996 99.6% 989 98.9% et Estonian 1000 996 99.6% 998 99.8% fi Finnish 1000 997 99.7% 998 99.8% 1000 100.0% fr French 1000 999 99.9% 999 99.9% 992 99.2% hu Hungarian 1000 1000 100.0% 999 99.9% 999 99.9% it Italian 1000 999 99.9% 999 99.9% 996 99.6% lt Lithuanian 1000 997 99.7% 999 99.9% lv Latvian 1000 999 99.9% 998 99.8% nl Dutch 1000 1000 100.0% 974 97.4% 995 99.5% pl Polish 1000 998 99.8% 999 99.9% 997 99.7% pt Portuguese 1000 995 99.5% 996 99.6% 989 98.9% ro Romanian 1000 1000 100.0% 999 99.9% 998 99.8% sk Slovak 1000 988 98.8% 990 99.0% sl Slovene 1000 976 97.6% 963 96.3% sv Swedish 1000 995 99.5% 991 99.1% 993 99.3% total 21000 13957 99.7% 20850 99.3% 20814 99.1% ※ Only supported languages for ldig Short Text Language Detection with Infinity-Gram 66 (NAIST Seminar)
  • 65. Conclusions • Language detector using maximal substring model – Detect over 99% accuracy for 19 languages. – langdetect with tweet corpus even has 97% accuracy. • If the corpus is maintained, the precision will be still up. – There are still many mistakes (in particular da and no) • If metadata is added to features, the precision will be still up. – How to add and train metadata at low cost? • Desire to shrink the model without loss of precision. – Too large for application (>100MB) Short Text Language Detection with Infinity-Gram 67 (NAIST Seminar)
  • 66. References • [中谷 NLP12] 極大部分文字列を使った twitter 言語判定 • [Okanohara+ 09] Text Categorization with All Substring Features • [Brody+ 11] Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs • [Cavnar+ 94] N-Gram-Based Text Categorization • [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty Short Text Language Detection with Infinity-Gram 68 (NAIST Seminar)