Japanese Linguistics in Lucene and Solr

Japanese linguistics
in Apache Lucene™ and Apache Solr™

May 9th, 2012

Christian Moen
christian@atilika.com

About me
• MSc. in computer science, University of Oslo, Norway
• Worked with search at FAST (now Microsoft) for 10 years
• 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway
• 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan
• Founded アティリカ株式会社 in 2009
• We help companies innovate using search technologies and good ideas
• We know information retrieval, natural language processing and big data
• We are based in Tokyo, but we have clients everywhere
• Newbie Lucene & Solr Committer
• Mostly been working on Japanese language support (Kuromoji) so far
• Please write me on christian@atilika.com or cm@apache.org

Today’s topics

• Japanese 101 - ordering beer and toasting

• Japanese language processing

• Japanese features in Lucene/Solr

ビールください
bi-ru kudasai

ビールください
bi-ru kudasai

A beer, please

ありがとうございます！
arigatō gozaimasu!

ありがとうございます！
arigatō gozaimasu!

Thank you very much!

ＪＲ新宿駅の近くにビールを飲みに行こうか？
JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?

JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?

Shall we go for a beer near JR Shinjuku station?

Romaji - ローマ字
・Latin characters (26+)
・Used for proper nouns, etc.


Katakana - カタカナ
・Phonetic script (~50)
・Typically used for loan words



Kanji - 漢字
・Chinese characters (50,000+)
・Used for stems & proper nouns


Hiragana - ひらがな
・Phonetic script (~50)
・Used for inﬂections & particles

Romaji - ローマ字 Katakana - カタカナ
・Latin characters (26+) ・Phonetic script (~50)
・Used for proper nouns, etc. ・Typically used for loan words


Kanji - 漢字 Hiragana - ひらがな
・Chinese characters (50,000+) ・Phonetic script (~50)
・Used for stems & proper nouns ・Used for inﬂections & particles

? What are the words in this sentence?

? What are the words in this sentence?
! Words are implicit in Japanese - there
is no white space that separates them

? How do we index this for search, then?

? How do we index this for search, then?
! We need to segment text into tokens ﬁrst

! Two major approaches for segmentation

1. n-gramming
2. morphological analysis
(statistical approach)

n-gramming (n=2)

n-gramming (n=2)
ＪＲ Shall we go for a beer near JR Shinjuku station?
n=2

ＪＲ

n-gramming (n=2)
n=2
Ｒ新

ＪＲＲ新

n-gramming (n=2)
n=2
Ｒ新

新宿

ＪＲＲ新新宿

n-gramming (n=2)
n=2
Ｒ新

新宿

宿駅

ＪＲＲ新新宿宿駅

n-gramming (n=2)
n=2
Ｒ新

新宿

宿駅

駅の

ＪＲＲ新新宿宿駅駅の

n-gramming (n=2)
n=2
Ｒ新

新宿

宿駅

駅の

の近

ＪＲＲ新新宿宿駅駅のの近

n-gramming (n=2)
n=2
Ｒ新

新宿

宿駅

駅の

の近

近く

ＪＲＲ新新宿宿駅駅のの近近く

Problems with n-gramming
ＪＲＲ新新宿宿駅駅のの近近く ...

●

● ×

● × ●

● × ● ×
change of
semantics!
means ‘post town’, ‘relay station’ or ‘stage’

● × ● × ×
change of
semantics!

● × ● × × ×
change of
semantics!

● × ● × × × ●
change of
semantics!

● × ● × × × ●
change of
semantics!

• Does not preserve meaning well and often changes semantics
• Impacts on ranking - search precision (many false positives)
Generates many terms per document or query
Impacts on index size and search performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...

● × ● × × × ●
change of
semantics!

• Also generates many terms per document or query
• Impacts on index size and performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...

● × ● × × × ●
change of
semantics!

• Also generates many terms per document or query
• Impacts on index size and performance
• Still sometimes appropriate for certain search applications
• Compliance, e-commerce with special product names, ...

Morphological analysis


● ● ● ● ● ● ● ● ● ● ● ● ● ●


● ● ● ● ● ● ● ● ● ● ● ● ● ●
• Tokens reﬂect what a Japanese speaker consider as words
• Machine-learned statistical approach
• CRFs decoded using Viterbi
• Also does part-of-speech tagging, readings for kanji, etc.
• Several statistical models available with high accuracy (F > 0.97)
• Models/dictionaries are available as IPADIC, UniDic, ...


● ● ● ● ● ● ● ● ● ● ● ● ● ●
• Conditional Random Fields (CRFs) decoded using Viterbi
• Also does part-of-speech tagging, extract readings for kanji, etc.


● ● ● ● ● ● ● ● ● ● ● ● ● ●
• Conditional Random Fields (CRFs) decoded using Viterbi
• Also does part-of-speech tagging, readings for kanji, etc.

Japanese support in
Lucene and Solr

Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6


! Available out-of-the-box



! Easy to use with reasonable defaults




! Provides sophisticated Japanese linguistics




! Provides sophisticated Japanese linguistics

! Customisable

How do we use it?

! Use JapaneseAnalyzer

How do we use it?

! Use JapaneseAnalyzer

! Use ﬁeld type “text_ja”
in example schema.xml

Feature summary / text_ja analyzer chain
Segments Japanese text into tokens with very high accuracy
JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc.
• Compound segmentation with compound synonyms
• Segmentation is customisable using user dictionaries


JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)



Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
See example/solr/conf/lang/stoptags_ja.txt




CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)





Stop-words removal
StopFilter
See example/solr/conf/lang/stopwords_ja.txt





Stop-words removal
StopFilter

JapaneseKatakanaStemFilter Normalises common katakana spelling variations





Stop-words removal
StopFilter

JapaneseKatakanaStemFilter Normalises common katakana spelling variations

LowerCaseFilter Lowercases

Compound nouns
? How do we deal with compound nouns?

Compound nouns
Japanese English
関西国際空港 Kansai International Airport
シニアソフトウェアエンジニア Senior Software Engineer

Compound nouns
Japanese English

! These are one word in Japanese, so
searching for 空港 (airport) doesn’t match

Compound nouns
Japanese English

! These are one word in Japanese, so
searching for 空港 (airport) doesn’t match

! We need to segment the compounds, too

Compound segmentation

関西国際空港
Kansai International Airport
シニアソフトウェアエンジニナ
Senior Software Engineer

! We are using a heuristic to implement this


関西国際空港関西
Kansai International Airport Kansai
シニアソフトウェアエンジニナシニア
Senior Software Engineer Senior



関西国際空港関西国際
Kansai International Airport Kansai International
シニアソフトウェアエンジニナシニアソフトウェア
Senior Software Engineer Senior Software



関西国際空港関西国際空港
Kansai International Airport Kansai International Airport
シニアソフトウェアエンジニナシニアソフトウェアエンジニナ
Senior Software Engineer Senior Software Engineer


Compound synonym tokens
Position 1 Position 2 Position 3
関西国際空港
関西国際空港

• Segment the compounds into its part
• Good for recall - we can also search and match 空港 (airport)
• We keep the compound itself as a synonym
• Good for precision with an exact hit because of IDF
• Approach beneﬁts both precision and recall for overall good ranking
• JapaneseTokenizer actually returns a graph of tokens

Compound synonym tokens
Position 1 Position 2 Position 3
関西国際空港
関西国際空港

• Segment the compounds into its parts
• Good for recall - we can also search and match 空港 (airport)
• We keep the compound itself as a synonym
• Good for precision with an exact hit because of IDF
• Approach beneﬁts both precision and recall for overall good ranking
• JapaneseTokenizer actually returns a graph of tokens

Character width normalisation
? How do we deal with character widths?
Half-width・半角 Full-width・全角
Lucene Ｌｕｃｅｎｅ
ｶﾀｶﾅカタカナ
123 １２３

Character width normalisation
? How do we deal with character widths?
Half-width・半角 Full-width・全角
Lucene Ｌｕｃｅｎｅ
ｶﾀｶﾅカタカナ
123 １２３

! Use CJKWidthFilter to normalise them
(Unicode NFKC subset)

Input text Ｌｕｃｅｎｅｶﾀｶﾅ１２３

CJKWidthFilter Lucene カタカナ 123

half-width full-width half-width

Katakana end-vowel stemming
? A common spelling variation in
katakana is a end long-vowel sound
English Japanese spelling variations
manager マネージャーマネージャマネジャー

Katakana end-vowel stemming
? A common spelling variation in
katakana is a end long-vowel sound
English Japanese spelling variations
manager マネージャーマネージャマネジャー

! We JapaneseKatakanaStemFilter to
normalise/stem end-vowel for long terms

Input text コピーマネージャーマネージャマネジャー
JapaneseKatakanaStemFilter コピーマネージャマネージャマネジャ
copy manager manager “manager”

Lemmatisation
? Japanese adjectives and verbs are highly
inﬂected, how do we deal with that?

Lemmatisation
Dictionary form

買う
kau
to buy

Lemmatisation
Dictionary form Inﬂected forms (not exhaustive)
買いなさい買いませんでしたら買える買わせられる

買う買いなさるな
買いましたら
買いませんでしたり
買いませんなら
買おう
買った
買わせる
買わない
買いましたり買うだろう買ったら買わないだろう

kau 買いまして
買いましょう
買うでしょう
買うな
買ったり
買って
買わないで
買わないでしょう
買わせない

to buy
買います買うまい買わなかった
買いますまい買え買わせます買わなかったら
買いませば買えない買わせません買わなかったり
買いません買えば買わせられない買わなければ
買いませんで買えます買わせられます買われない
買いませんでした買えません買わせられません買われます

Lemmatisation
Dictionary form Inﬂected forms (not exhaustive)
買いなさい買いませんでしたら買える買わせられる

買う買いなさるな
買いましたら
買いませんでしたり
買いませんなら
買おう
買った
買わせる
買わない
買いましたり買うだろう買ったら買わないだろう

kau 買いまして
買いましょう
買うでしょう
買うな
買ったり
買って
買わないで
買わないでしょう
買わせない

to buy
買います買うまい買わなかった
買いますまい買え買わせます買わなかったら
買いませば買えない買わせません買わなかったり
買いません買えば買わせられない買わなければ
買いませんで買えます買わせられます買われない
買いませんでした買えません買わせられません買われます

! Use JapaneseBaseformFilter to normalise
inﬂected adjectives and verbs to dictionary form
(lemmatisation by reduction)

User dictionaries
• Own dictionaries can be used for ad hoc
segmentation, i.e. to override default model
• File format is simple and there’s no need to
assign weights, etc. before using them
• Example custom dictionary:
# Custom segmentation and POS entry for long entries
関西国際空港,関西国際空港,カンサイコクサイクウコウ,カスタム名詞

# Custom reading and POS former sumo wrestler Asashoryu
朝青龍,朝青龍,アサショウリュウ,カスタム人名

Japanese focus in 4.0
• Improvements in JapaneseTokenizer
• Improved search mode for katakana compounds
• Improved unknown word segmentation
• Some performance improvements
• CharFilters for various character normalisations
• Dates and numbers
• Repetition marks (odoriji)
• Japanese spell-checker
• Robert and Koji almost got this into 3.6, but it got
postponed because of API changes being necessary

Acknowledgements
Robert Muir
Thanks for the heavy lifting integrating Kuromoji into Lucene
and always reviewing my patches quickly and friendly help
Michael McCandless
Thanks for streaming Viterbi and synonym compounds!
Uwe Schindler
Thanks for performance improvements + being the policeman
Simon Willnauer
Thanks for doing the Kuromoji code donation process so well
Gaute Lambertsen & Gerry Hocks
Thanks for presentation feedback and being great colleagues

ありがとうございました！
arigatō gozaimashita!

Thank you very much!

Japanese Linguistics in Lucene and Solr

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Mehr von lucenerevolution

Mehr von lucenerevolution (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Japanese Linguistics in Lucene and Solr