SlideShare ist ein Scribd-Unternehmen logo
1 von 99
Downloaden Sie, um offline zu lesen
Japanese linguistics
in Apache Lucene™ and Apache Solr™

             May 9th, 2012

             Christian Moen
          christian@atilika.com
About me
•   MSc. in computer science, University of Oslo, Norway
•   Worked with search at FAST (now Microsoft) for 10 years
     •   5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway
     •   5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan
•   Founded アティリカ株式会社 in 2009
     •   We help companies innovate using search technologies and good ideas
     •   We know information retrieval, natural language processing and big data
     •   We are based in Tokyo, but we have clients everywhere
•   Newbie Lucene & Solr Committer
     •   Mostly been working on Japanese language support (Kuromoji) so far
•   Please write me on christian@atilika.com or cm@apache.org
Today’s topics
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Japanese 101
ビールください
 bi-ru kudasai
ビールください
 bi-ru kudasai

A beer, please
ありがとうございます!
 arigatō gozaimasu!
ありがとうございます!
 arigatō gozaimasu!

Thank you very much!
乾杯!
kanpai!
乾杯!
kanpai!

Cheers!
JR新宿駅の近くにビールを飲みに行こうか?
JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?
JR新宿駅の近くにビールを飲みに行こうか?
JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?

  Shall we go for a beer near JR Shinjuku station?
JR新宿駅の近くにビールを飲みに行こうか?
Romaji - ローマ字
・Latin characters (26+)
・Used for proper nouns, etc.



 JR新宿駅の近くにビールを飲みに行こうか?
Katakana - カタカナ
          ・Phonetic script (~50)
          ・Typically used for loan words



JR新宿駅の近くにビールを飲みに行こうか?
JR新宿駅の近くにビールを飲みに行こうか?


Kanji - 漢字
・Chinese characters (50,000+)
・Used for stems & proper nouns
JR新宿駅の近くにビールを飲みに行こうか?


          Hiragana - ひらがな
          ・Phonetic script (~50)
          ・Used for inflections & particles
Romaji - ローマ字                   Katakana - カタカナ
・Latin characters (26+)         ・Phonetic script (~50)
・Used for proper nouns, etc.    ・Typically used for loan words



 JR新宿駅の近くにビールを飲みに行こうか?


Kanji - 漢字                      Hiragana - ひらがな
・Chinese characters (50,000+)   ・Phonetic script (~50)
・Used for stems & proper nouns ・Used for inflections & particles
JR新宿駅の近くにビールを飲みに行こうか?
JR新宿駅の近くにビールを飲みに行こうか?
? What are the words in this sentence?
JR新宿駅の近くにビールを飲みに行こうか?
? What are the words in this sentence?
! Words are implicit in Japanese - there
  is no white space that separates them
JR新宿駅の近くにビールを飲みに行こうか?
? How do we index this for search, then?
JR新宿駅の近くにビールを飲みに行こうか?
? How do we index this for search, then?
! We need to segment text into tokens first
! Two major approaches for segmentation

          1. n-gramming
          2. morphological analysis
            (statistical approach)
n-gramming (n=2)
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?
n-gramming (n=2)
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR               Shall we go for a beer near JR Shinjuku station?
 n=2




JR
n-gramming (n=2)
J R新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                Shall we go for a beer near JR Shinjuku station?
 n=2
       R新




JR R新
n-gramming (n=2)
J R 新宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                     Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿




JR R新 新宿
n-gramming (n=2)
J R 新 宿駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                      Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅




JR R新 新宿 宿駅
n-gramming (n=2)
J R 新 宿 駅の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                        Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の




JR R新 新宿 宿駅 駅の
n-gramming (n=2)
J R 新 宿 駅 の近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                             Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の

                           の近




JR R新 新宿 宿駅 駅の の近
n-gramming (n=2)
J R 新 宿 駅 の 近く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                                  Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の

                           の近


                                近く




JR R新 新宿 宿駅 駅の の近 近く
Problems with n-gramming
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×  ×  ●
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
Generates many terms per document or query
Impacts on index size and search performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
•   Also generates many terms per document or query
     •   Impacts on index size and performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
•   Also generates many terms per document or query
     •   Impacts on index size and performance
•   Still sometimes appropriate for certain search applications
     •   Compliance, e-commerce with special product names, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   CRFs decoded using Viterbi
       •   Also does part-of-speech tagging, readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   Conditional Random Fields (CRFs) decoded using Viterbi
       •   Also does part-of-speech tagging, extract readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   Conditional Random Fields (CRFs) decoded using Viterbi
       •   Also does part-of-speech tagging, readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
How does this actually work?
Demo
Japanese support in
  Lucene and Solr
Japanese in Lucene/Solr
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults

! Provides sophisticated Japanese linguistics
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults

! Provides sophisticated Japanese linguistics

! Customisable
How do we use it?
How do we use it?

      ! Use JapaneseAnalyzer
How do we use it?

      ! Use JapaneseAnalyzer



      ! Use field type “text_ja”
        in example schema.xml
Demo
Feature summary / text_ja analyzer chain
                       Segments Japanese text into tokens with very high accuracy
   JapaneseTokenizer   •   Token attributes for part-of-speech, base form, readings, etc.
                       •   Compound segmentation with compound synonyms
                       •   Segmentation is customisable using user dictionaries
Feature summary / text_ja analyzer chain
                         Segments Japanese text into tokens with very high accuracy
     JapaneseTokenizer    •   Token attributes for part-of-speech, base form, readings, etc.
                          •   Compound segmentation with compound synonyms
                          •   Segmentation is customisable using user dictionaries


JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
Feature summary / text_ja analyzer chain
                                 Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer     •   Token attributes for part-of-speech, base form, readings, etc.
                                  •   Compound segmentation with compound synonyms
                                  •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                 Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                 See example/solr/conf/lang/stoptags_ja.txt
Feature summary / text_ja analyzer chain
                                 Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer     •   Token attributes for part-of-speech, base form, readings, etc.
                                  •   Compound segmentation with compound synonyms
                                  •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                 Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                 See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt


   JapaneseKatakanaStemFilter Normalises common katakana spelling variations
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt


   JapaneseKatakanaStemFilter Normalises common katakana spelling variations

               LowerCaseFilter Lowercases
Feature details
Compound nouns
? How do we deal with compound nouns?
Compound nouns
? How do we deal with compound nouns?
      Japanese                English
    関西国際空港           Kansai International Airport
シニアソフトウェアエンジニア        Senior Software Engineer
Compound nouns
? How do we deal with compound nouns?
       Japanese                  English
    関西国際空港              Kansai International Airport
シニアソフトウェアエンジニア           Senior Software Engineer


! These are one word in Japanese, so
  searching for 空港 (airport) doesn’t match
Compound nouns
? How do we deal with compound nouns?
       Japanese                  English
    関西国際空港              Kansai International Airport
シニアソフトウェアエンジニア           Senior Software Engineer


! These are one word in Japanese, so
  searching for 空港 (airport) doesn’t match

! We need to segment the compounds, too
Compound segmentation

    関西国際空港
Kansai International Airport
シニアソフトウェアエンジニナ
 Senior Software Engineer




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西
Kansai International Airport   Kansai
シニアソフトウェアエンジニナ                 シニア
 Senior Software Engineer      Senior




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西          国際
Kansai International Airport   Kansai   International
シニアソフトウェアエンジニナ                 シニア      ソフトウェア
 Senior Software Engineer      Senior    Software




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西          国際            空港
Kansai International Airport   Kansai   International   Airport
シニアソフトウェアエンジニナ                 シニア      ソフトウェア          エンジニナ
 Senior Software Engineer      Senior    Software       Engineer




 ! We are using a heuristic to implement this
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its part
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Character width normalisation
? How do we deal with character widths?
         Half-width・半角   Full-width・全角
            Lucene        Lucene
             カタカナ          カタカナ
             123           123
Character width normalisation
? How do we deal with character widths?
              Half-width・半角              Full-width・全角
                   Lucene                 Lucene
                    カタカナ                   カタカナ
                    123                    123


! Use CJKWidthFilter to normalise them
  (Unicode NFKC subset)



             Input text Lucene             カタカナ        123

        CJKWidthFilter      Lucene        カタカナ          123

                            half-width    full-width   half-width
Katakana end-vowel stemming
? A common spelling variation in
  katakana is a end long-vowel sound
   English   Japanese spelling variations
  manager    マネージャー            マネージャ        マネジャー
Katakana end-vowel stemming
  ? A common spelling variation in
    katakana is a end long-vowel sound
       English     Japanese spelling variations
       manager     マネージャー            マネージャ         マネジャー



   ! We JapaneseKatakanaStemFilter to
     normalise/stem end-vowel for long terms

                 Input text コピー     マネージャー        マネージャ      マネジャー
JapaneseKatakanaStemFilter コピー       マネージャ        マネージャ      マネジャ
                            copy       manager     manager   “manager”
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form


        買う
       kau
      to buy
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form   Inflected forms (not exhaustive)
                       買いなさい       買いませんでしたら   買える        買わせられる


        買う             買いなさるな
                       買いましたら
                                   買いませんでしたり
                                   買いませんなら
                                               買おう
                                               買った
                                                          買わせる
                                                          買わない
                       買いましたり      買うだろう       買ったら       買わないだろう


       kau             買いまして
                       買いましょう
                                   買うでしょう
                                   買うな
                                               買ったり
                                               買って
                                                          買わないで
                                                          買わないでしょう
                                               買わせない

      to buy
                       買います        買うまい                   買わなかった
                       買いますまい      買え          買わせます      買わなかったら
                       買いませば       買えない        買わせません     買わなかったり
                       買いません       買えば         買わせられない    買わなければ
                       買いませんで      買えます        買わせられます    買われない
                       買いませんでした    買えません       買わせられません   買われます
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form      Inflected forms (not exhaustive)
                           買いなさい      買いませんでしたら   買える        買わせられる


        買う                 買いなさるな
                           買いましたら
                                      買いませんでしたり
                                      買いませんなら
                                                  買おう
                                                  買った
                                                             買わせる
                                                             買わない
                           買いましたり     買うだろう       買ったら       買わないだろう


       kau                 買いまして
                           買いましょう
                                      買うでしょう
                                      買うな
                                                  買ったり
                                                  買って
                                                             買わないで
                                                             買わないでしょう
                                                  買わせない

      to buy
                           買います       買うまい                   買わなかった
                           買いますまい     買え          買わせます      買わなかったら
                           買いませば      買えない        買わせません     買わなかったり
                           買いません      買えば         買わせられない    買わなければ
                           買いませんで     買えます        買わせられます    買われない
                           買いませんでした   買えません       買わせられません   買われます




 ! Use JapaneseBaseformFilter to normalise
   inflected adjectives and verbs to dictionary form
   (lemmatisation by reduction)
User dictionaries
•   Own dictionaries can be used for ad hoc
    segmentation, i.e. to override default model
•   File format is simple and there’s no need to
    assign weights, etc. before using them
•   Example custom dictionary:
# Custom segmentation and POS entry for long entries
関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞

# Custom reading and POS former sumo wrestler Asashoryu
朝青龍,朝青龍,アサショウリュウ,カスタム人名
Japanese focus in 4.0
•   Improvements in JapaneseTokenizer
     •   Improved search mode for katakana compounds
     •   Improved unknown word segmentation
     •   Some performance improvements
•   CharFilters for various character normalisations
     •   Dates and numbers
     •   Repetition marks (odoriji)
•   Japanese spell-checker
     •   Robert and Koji almost got this into 3.6, but it got
         postponed because of API changes being necessary
Acknowledgements
Robert Muir
Thanks for the heavy lifting integrating Kuromoji into Lucene
and always reviewing my patches quickly and friendly help
Michael McCandless
Thanks for streaming Viterbi and synonym compounds!
Uwe Schindler
Thanks for performance improvements + being the policeman
Simon Willnauer
Thanks for doing the Kuromoji code donation process so well
Gaute Lambertsen & Gerry Hocks
Thanks for presentation feedback and being great colleagues
Q&A
ありがとうございました!
 arigatō gozaimashita!

Thank you very much!

Weitere ähnliche Inhalte

Was ist angesagt?

WHOIS教室 〜 JPOPM36 20190621
WHOIS教室 〜 JPOPM36 20190621WHOIS教室 〜 JPOPM36 20190621
WHOIS教室 〜 JPOPM36 20190621Akira Nakagawa
 
【BS13】チーム開発がこんなにも快適に!コーディングもデバッグも GitHub 上で。 GitHub Codespaces で叶えられるシームレスな開発
【BS13】チーム開発がこんなにも快適に!コーディングもデバッグも GitHub 上で。 GitHub Codespaces で叶えられるシームレスな開発【BS13】チーム開発がこんなにも快適に!コーディングもデバッグも GitHub 上で。 GitHub Codespaces で叶えられるシームレスな開発
【BS13】チーム開発がこんなにも快適に!コーディングもデバッグも GitHub 上で。 GitHub Codespaces で叶えられるシームレスな開発日本マイクロソフト株式会社
 
CRDT in 15 minutes
CRDT in 15 minutesCRDT in 15 minutes
CRDT in 15 minutesShingo Omura
 
レコメンドバッチ高速化に向けたSpark/MapReduceの機械学習ライブラリ比較検証
レコメンドバッチ高速化に向けたSpark/MapReduceの機械学習ライブラリ比較検証レコメンドバッチ高速化に向けたSpark/MapReduceの機械学習ライブラリ比較検証
レコメンドバッチ高速化に向けたSpark/MapReduceの機械学習ライブラリ比較検証Recruit Technologies
 
ソフトウェアの品質保証の基礎とこれから
ソフトウェアの品質保証の基礎とこれからソフトウェアの品質保証の基礎とこれから
ソフトウェアの品質保証の基礎とこれからYasuharu Nishi
 
広告がうざい
広告がうざい広告がうざい
広告がうざいGen Ito
 
DockerコンテナでGitを使う
DockerコンテナでGitを使うDockerコンテナでGitを使う
DockerコンテナでGitを使うKazuhiro Suga
 
イベント駆動プログラミングとI/O多重化
イベント駆動プログラミングとI/O多重化イベント駆動プログラミングとI/O多重化
イベント駆動プログラミングとI/O多重化Gosuke Miyashita
 
マッチングサービスにおけるKPIの話
マッチングサービスにおけるKPIの話マッチングサービスにおけるKPIの話
マッチングサービスにおけるKPIの話cyberagent
 
[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative ModelsDeep Learning JP
 
ICCV19読み会 "Learning Single Camera Depth Estimation using Dual-Pixels"
ICCV19読み会 "Learning Single Camera Depth Estimation using Dual-Pixels"ICCV19読み会 "Learning Single Camera Depth Estimation using Dual-Pixels"
ICCV19読み会 "Learning Single Camera Depth Estimation using Dual-Pixels"Hajime Mihara
 
ITコミュニティと情報発信に共通する成長と貢献の要素
ITコミュニティと情報発信に共通する成長と貢献の要素ITコミュニティと情報発信に共通する成長と貢献の要素
ITコミュニティと情報発信に共通する成長と貢献の要素NISHIHARA Shota
 
WebSocketのキホン
WebSocketのキホンWebSocketのキホン
WebSocketのキホンYou_Kinjoh
 
ctfで学ぼうリバースエンジニアリング
ctfで学ぼうリバースエンジニアリングctfで学ぼうリバースエンジニアリング
ctfで学ぼうリバースエンジニアリングjunk_coken
 
実践 Amazon Mechanical Turk ※下記の注意点をご覧ください(回答の質の悪化・報酬額の相場の変化・仕様変更)
実践 Amazon Mechanical Turk ※下記の注意点をご覧ください(回答の質の悪化・報酬額の相場の変化・仕様変更)実践 Amazon Mechanical Turk ※下記の注意点をご覧ください(回答の質の悪化・報酬額の相場の変化・仕様変更)
実践 Amazon Mechanical Turk ※下記の注意点をご覧ください(回答の質の悪化・報酬額の相場の変化・仕様変更)Ayako_Hasegawa
 
超実践 Cloud Spanner 設計講座
超実践 Cloud Spanner 設計講座超実践 Cloud Spanner 設計講座
超実践 Cloud Spanner 設計講座Samir Hammoudi
 
SSII2021 [TS3] 機械学習のアノテーションにおける データ収集​ 〜 精度向上のための仕組み・倫理や社会性バイアス 〜
SSII2021 [TS3] 機械学習のアノテーションにおける データ収集​ 〜 精度向上のための仕組み・倫理や社会性バイアス 〜SSII2021 [TS3] 機械学習のアノテーションにおける データ収集​ 〜 精度向上のための仕組み・倫理や社会性バイアス 〜
SSII2021 [TS3] 機械学習のアノテーションにおける データ収集​ 〜 精度向上のための仕組み・倫理や社会性バイアス 〜SSII
 
Java開発の強力な相棒として今すぐ使えるGroovy
Java開発の強力な相棒として今すぐ使えるGroovyJava開発の強力な相棒として今すぐ使えるGroovy
Java開発の強力な相棒として今すぐ使えるGroovyYasuharu Nakano
 

Was ist angesagt? (20)

LBFGSの実装
LBFGSの実装LBFGSの実装
LBFGSの実装
 
WHOIS教室 〜 JPOPM36 20190621
WHOIS教室 〜 JPOPM36 20190621WHOIS教室 〜 JPOPM36 20190621
WHOIS教室 〜 JPOPM36 20190621
 
いつやるの?Git入門
いつやるの?Git入門いつやるの?Git入門
いつやるの?Git入門
 
【BS13】チーム開発がこんなにも快適に!コーディングもデバッグも GitHub 上で。 GitHub Codespaces で叶えられるシームレスな開発
【BS13】チーム開発がこんなにも快適に!コーディングもデバッグも GitHub 上で。 GitHub Codespaces で叶えられるシームレスな開発【BS13】チーム開発がこんなにも快適に!コーディングもデバッグも GitHub 上で。 GitHub Codespaces で叶えられるシームレスな開発
【BS13】チーム開発がこんなにも快適に!コーディングもデバッグも GitHub 上で。 GitHub Codespaces で叶えられるシームレスな開発
 
CRDT in 15 minutes
CRDT in 15 minutesCRDT in 15 minutes
CRDT in 15 minutes
 
レコメンドバッチ高速化に向けたSpark/MapReduceの機械学習ライブラリ比較検証
レコメンドバッチ高速化に向けたSpark/MapReduceの機械学習ライブラリ比較検証レコメンドバッチ高速化に向けたSpark/MapReduceの機械学習ライブラリ比較検証
レコメンドバッチ高速化に向けたSpark/MapReduceの機械学習ライブラリ比較検証
 
ソフトウェアの品質保証の基礎とこれから
ソフトウェアの品質保証の基礎とこれからソフトウェアの品質保証の基礎とこれから
ソフトウェアの品質保証の基礎とこれから
 
広告がうざい
広告がうざい広告がうざい
広告がうざい
 
DockerコンテナでGitを使う
DockerコンテナでGitを使うDockerコンテナでGitを使う
DockerコンテナでGitを使う
 
イベント駆動プログラミングとI/O多重化
イベント駆動プログラミングとI/O多重化イベント駆動プログラミングとI/O多重化
イベント駆動プログラミングとI/O多重化
 
マッチングサービスにおけるKPIの話
マッチングサービスにおけるKPIの話マッチングサービスにおけるKPIの話
マッチングサービスにおけるKPIの話
 
[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models[DL輪読会]Flow-based Deep Generative Models
[DL輪読会]Flow-based Deep Generative Models
 
ICCV19読み会 "Learning Single Camera Depth Estimation using Dual-Pixels"
ICCV19読み会 "Learning Single Camera Depth Estimation using Dual-Pixels"ICCV19読み会 "Learning Single Camera Depth Estimation using Dual-Pixels"
ICCV19読み会 "Learning Single Camera Depth Estimation using Dual-Pixels"
 
ITコミュニティと情報発信に共通する成長と貢献の要素
ITコミュニティと情報発信に共通する成長と貢献の要素ITコミュニティと情報発信に共通する成長と貢献の要素
ITコミュニティと情報発信に共通する成長と貢献の要素
 
WebSocketのキホン
WebSocketのキホンWebSocketのキホン
WebSocketのキホン
 
ctfで学ぼうリバースエンジニアリング
ctfで学ぼうリバースエンジニアリングctfで学ぼうリバースエンジニアリング
ctfで学ぼうリバースエンジニアリング
 
実践 Amazon Mechanical Turk ※下記の注意点をご覧ください(回答の質の悪化・報酬額の相場の変化・仕様変更)
実践 Amazon Mechanical Turk ※下記の注意点をご覧ください(回答の質の悪化・報酬額の相場の変化・仕様変更)実践 Amazon Mechanical Turk ※下記の注意点をご覧ください(回答の質の悪化・報酬額の相場の変化・仕様変更)
実践 Amazon Mechanical Turk ※下記の注意点をご覧ください(回答の質の悪化・報酬額の相場の変化・仕様変更)
 
超実践 Cloud Spanner 設計講座
超実践 Cloud Spanner 設計講座超実践 Cloud Spanner 設計講座
超実践 Cloud Spanner 設計講座
 
SSII2021 [TS3] 機械学習のアノテーションにおける データ収集​ 〜 精度向上のための仕組み・倫理や社会性バイアス 〜
SSII2021 [TS3] 機械学習のアノテーションにおける データ収集​ 〜 精度向上のための仕組み・倫理や社会性バイアス 〜SSII2021 [TS3] 機械学習のアノテーションにおける データ収集​ 〜 精度向上のための仕組み・倫理や社会性バイアス 〜
SSII2021 [TS3] 機械学習のアノテーションにおける データ収集​ 〜 精度向上のための仕組み・倫理や社会性バイアス 〜
 
Java開発の強力な相棒として今すぐ使えるGroovy
Java開発の強力な相棒として今すぐ使えるGroovyJava開発の強力な相棒として今すぐ使えるGroovy
Java開発の強力な相棒として今すぐ使えるGroovy
 

Andere mochten auch

形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介Toshinori Sato
 
機械学習の全般について 4
機械学習の全般について 4機械学習の全般について 4
機械学習の全般について 4Masato Nakai
 
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案Yahoo!デベロッパーネットワーク
 
Language support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemLanguage support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemlucenerevolution
 
Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Koki Shibata
 
深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーションYuya Unno
 

Andere mochten auch (7)

形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
 
機械学習の全般について 4
機械学習の全般について 4機械学習の全般について 4
機械学習の全般について 4
 
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
 
Language support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemLanguage support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco system
 
Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話
 
深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション
 
深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向
 

Mehr von lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Mehr von lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Kürzlich hochgeladen

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Kürzlich hochgeladen (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Japanese Linguistics in Lucene and Solr

  • 1. Japanese linguistics in Apache Lucene™ and Apache Solr™ May 9th, 2012 Christian Moen christian@atilika.com
  • 2. About me • MSc. in computer science, University of Oslo, Norway • Worked with search at FAST (now Microsoft) for 10 years • 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway • 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan • Founded アティリカ株式会社 in 2009 • We help companies innovate using search technologies and good ideas • We know information retrieval, natural language processing and big data • We are based in Tokyo, but we have clients everywhere • Newbie Lucene & Solr Committer • Mostly been working on Japanese language support (Kuromoji) so far • Please write me on christian@atilika.com or cm@apache.org
  • 4. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 5. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 6. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 15. JR新宿駅の近くにビールを飲みに行こうか? JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka? Shall we go for a beer near JR Shinjuku station?
  • 17. Romaji - ローマ字 ・Latin characters (26+) ・Used for proper nouns, etc. JR新宿駅の近くにビールを飲みに行こうか?
  • 18. Katakana - カタカナ ・Phonetic script (~50) ・Typically used for loan words JR新宿駅の近くにビールを飲みに行こうか?
  • 20. JR新宿駅の近くにビールを飲みに行こうか? Hiragana - ひらがな ・Phonetic script (~50) ・Used for inflections & particles
  • 21. Romaji - ローマ字 Katakana - カタカナ ・Latin characters (26+) ・Phonetic script (~50) ・Used for proper nouns, etc. ・Typically used for loan words JR新宿駅の近くにビールを飲みに行こうか? Kanji - 漢字 Hiragana - ひらがな ・Chinese characters (50,000+) ・Phonetic script (~50) ・Used for stems & proper nouns ・Used for inflections & particles
  • 24. JR新宿駅の近くにビールを飲みに行こうか? ? What are the words in this sentence? ! Words are implicit in Japanese - there is no white space that separates them
  • 26. JR新宿駅の近くにビールを飲みに行こうか? ? How do we index this for search, then? ! We need to segment text into tokens first
  • 27. ! Two major approaches for segmentation 1. n-gramming 2. morphological analysis (statistical approach)
  • 28. n-gramming (n=2) JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?
  • 29. n-gramming (n=2) JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 JR
  • 30. n-gramming (n=2) J R新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 JR R新
  • 31. n-gramming (n=2) J R 新宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 JR R新 新宿
  • 32. n-gramming (n=2) J R 新 宿駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 JR R新 新宿 宿駅
  • 33. n-gramming (n=2) J R 新 宿 駅の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の JR R新 新宿 宿駅 駅の
  • 34. n-gramming (n=2) J R 新 宿 駅 の近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の の近 JR R新 新宿 宿駅 駅の の近
  • 35. n-gramming (n=2) J R 新 宿 駅 の 近く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の の近 近く JR R新 新宿 宿駅 駅の の近 近く
  • 37. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ...
  • 38. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ●
  • 39. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● ×
  • 40. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ●
  • 41. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 42. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 43. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 44. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 45. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) Generates many terms per document or query Impacts on index size and search performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  • 46. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  • 47. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance • Still sometimes appropriate for certain search applications • Compliance, e-commerce with special product names, ...
  • 48. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?
  • 49. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
  • 50. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • 51. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • CRFs decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 52. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, extract readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 53. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 54. How does this actually work?
  • 55. Demo
  • 56. Japanese support in Lucene and Solr
  • 58. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6
  • 59. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box
  • 60. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults
  • 61. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults ! Provides sophisticated Japanese linguistics
  • 62. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults ! Provides sophisticated Japanese linguistics ! Customisable
  • 63. How do we use it?
  • 64. How do we use it? ! Use JapaneseAnalyzer
  • 65. How do we use it? ! Use JapaneseAnalyzer ! Use field type “text_ja” in example schema.xml
  • 66. Demo
  • 67. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries
  • 68. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
  • 69. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt
  • 70. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
  • 71. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt
  • 72. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations
  • 73. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations LowerCaseFilter Lowercases
  • 75. Compound nouns ? How do we deal with compound nouns?
  • 76. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer
  • 77. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer ! These are one word in Japanese, so searching for 空港 (airport) doesn’t match
  • 78. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer ! These are one word in Japanese, so searching for 空港 (airport) doesn’t match ! We need to segment the compounds, too
  • 79. Compound segmentation 関西国際空港 Kansai International Airport シニアソフトウェアエンジニナ Senior Software Engineer ! We are using a heuristic to implement this
  • 80. Compound segmentation 関西国際空港 関西 Kansai International Airport Kansai シニアソフトウェアエンジニナ シニア Senior Software Engineer Senior ! We are using a heuristic to implement this
  • 81. Compound segmentation 関西国際空港 関西 国際 Kansai International Airport Kansai International シニアソフトウェアエンジニナ シニア ソフトウェア Senior Software Engineer Senior Software ! We are using a heuristic to implement this
  • 82. Compound segmentation 関西国際空港 関西 国際 空港 Kansai International Airport Kansai International Airport シニアソフトウェアエンジニナ シニア ソフトウェア エンジニナ Senior Software Engineer Senior Software Engineer ! We are using a heuristic to implement this
  • 83. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its part • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 84. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 85. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 86. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 87. Character width normalisation ? How do we deal with character widths? Half-width・半角 Full-width・全角 Lucene Lucene カタカナ カタカナ 123 123
  • 88. Character width normalisation ? How do we deal with character widths? Half-width・半角 Full-width・全角 Lucene Lucene カタカナ カタカナ 123 123 ! Use CJKWidthFilter to normalise them (Unicode NFKC subset) Input text Lucene カタカナ 123 CJKWidthFilter Lucene カタカナ 123 half-width full-width half-width
  • 89. Katakana end-vowel stemming ? A common spelling variation in katakana is a end long-vowel sound English Japanese spelling variations manager マネージャー マネージャ マネジャー
  • 90. Katakana end-vowel stemming ? A common spelling variation in katakana is a end long-vowel sound English Japanese spelling variations manager マネージャー マネージャ マネジャー ! We JapaneseKatakanaStemFilter to normalise/stem end-vowel for long terms Input text コピー マネージャー マネージャ マネジャー JapaneseKatakanaStemFilter コピー マネージャ マネージャ マネジャ copy manager manager “manager”
  • 91. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that?
  • 92. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form 買う kau to buy
  • 93. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form Inflected forms (not exhaustive) 買いなさい 買いませんでしたら 買える 買わせられる 買う 買いなさるな 買いましたら 買いませんでしたり 買いませんなら 買おう 買った 買わせる 買わない 買いましたり 買うだろう 買ったら 買わないだろう kau 買いまして 買いましょう 買うでしょう 買うな 買ったり 買って 買わないで 買わないでしょう 買わせない to buy 買います 買うまい 買わなかった 買いますまい 買え 買わせます 買わなかったら 買いませば 買えない 買わせません 買わなかったり 買いません 買えば 買わせられない 買わなければ 買いませんで 買えます 買わせられます 買われない 買いませんでした 買えません 買わせられません 買われます
  • 94. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form Inflected forms (not exhaustive) 買いなさい 買いませんでしたら 買える 買わせられる 買う 買いなさるな 買いましたら 買いませんでしたり 買いませんなら 買おう 買った 買わせる 買わない 買いましたり 買うだろう 買ったら 買わないだろう kau 買いまして 買いましょう 買うでしょう 買うな 買ったり 買って 買わないで 買わないでしょう 買わせない to buy 買います 買うまい 買わなかった 買いますまい 買え 買わせます 買わなかったら 買いませば 買えない 買わせません 買わなかったり 買いません 買えば 買わせられない 買わなければ 買いませんで 買えます 買わせられます 買われない 買いませんでした 買えません 買わせられません 買われます ! Use JapaneseBaseformFilter to normalise inflected adjectives and verbs to dictionary form (lemmatisation by reduction)
  • 95. User dictionaries • Own dictionaries can be used for ad hoc segmentation, i.e. to override default model • File format is simple and there’s no need to assign weights, etc. before using them • Example custom dictionary: # Custom segmentation and POS entry for long entries 関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞 # Custom reading and POS former sumo wrestler Asashoryu 朝青龍,朝青龍,アサショウリュウ,カスタム人名
  • 96. Japanese focus in 4.0 • Improvements in JapaneseTokenizer • Improved search mode for katakana compounds • Improved unknown word segmentation • Some performance improvements • CharFilters for various character normalisations • Dates and numbers • Repetition marks (odoriji) • Japanese spell-checker • Robert and Koji almost got this into 3.6, but it got postponed because of API changes being necessary
  • 97. Acknowledgements Robert Muir Thanks for the heavy lifting integrating Kuromoji into Lucene and always reviewing my patches quickly and friendly help Michael McCandless Thanks for streaming Viterbi and synonym compounds! Uwe Schindler Thanks for performance improvements + being the policeman Simon Willnauer Thanks for doing the Kuromoji code donation process so well Gaute Lambertsen & Gerry Hocks Thanks for presentation feedback and being great colleagues
  • 98. Q&A