Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

An Introduction to NLP4L

2.316 Aufrufe

Veröffentlicht am

NLP4L slides for Lucene/Solr meetup

Veröffentlicht in: Technologie
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

An Introduction to NLP4L

  1. 1. An Introduction to NLP4L Natural Language Processing tool for Apache Lucene Koji Sekiguchi @kojisays Founder & CEO, RONDHUIT
  2. 2. My contributions • CharFilter framework & MappingCharFilter • FastVectorHighlighter 2
  3. 3. Agenda • What s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon) 3
  4. 4. Agenda • What s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon) 4
  5. 5. What s NLP4L? 5
  6. 6. What s NLP4L? • GOAL • Improve Lucene users search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Applications (e.g. Transliteration) 6
  7. 7. What s NLP4L? • GOAL • Improve Lucene users search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Applications (e.g. Transliteration) 7
  8. 8. What s NLP4L? • GOAL • Improve Lucene users search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Applications (e.g. Transliteration) 8
  9. 9. Agenda • What s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon) 9
  10. 10. Evaluation Measures targetresult tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 10
  11. 11. Recall ,Precision tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 11
  12. 12. targetresult tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) Recall ,Precision 12
  13. 13. Solution n-gram, synonym dictionary, etc. facet (filter query) Ranking Tuning recall precision recall , precision 13
  14. 14. Solution n-gram, synonym dictionary, etc. facet (filter query) Ranking Tuning recall precision recall , precision 14
  15. 15. Solution n-gram, synonym dictionary, etc. e.g. Transliteration facet (filter query) recall precision recall , precision Ranking Tuning 15
  16. 16. Solution n-gram, synonym dictionary, etc. e.g. Transliteration facet (filter query) e.g. Named Entity Extraction recall precision recall , precision Ranking Tuning 16
  17. 17. gradual precision improvement q=watch targetresult 17
  18. 18. filter by Gender=Men s targetresult gradual precision improvement 18
  19. 19. targetresult filter by Gender=Men s filter by Price=100-150 gradual precision improvement 19
  20. 20. Structured Documents ID product price gender 1 CURREN New Men s Date Stainless Steel Military Sport Quartz Wrist Watch 8.92 Men s 2 Suiksilver The Gamer Watch 87.99 Men s 20
  21. 21. Unstructured Documents ID article 1 David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels. 2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants. 21
  22. 22. Make them Structured I D article person org loc 1 David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels. David Cameron EU Bruss els 2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants. EU UK Britai n NEE[1] extracts interesting words. [1] Named Entity Extraction 22
  23. 23. Manual Tagging using brat 23
  24. 24. Agenda • What s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon) 24
  25. 25. Language Model • LM represents the fluency of language • N-gram model is the LM which is most widely used • Calculation example for 2-gram totalTermFreq(”word2g”,”an apple”) totalTermFreq(”word”,”an”) 25
  26. 26. Alice/NNP ate/VB an/AT apple/NNP ./. Mike/NNP likes/VB an/AT orange/NNP ./. An/AT apple/NNP is/VB red/JJ ./. NNP Proper noun, singular VB Verb AT Article JJ Adjective . period Part-of-Speech Tagging Our Corpus for training 26
  27. 27. Hidden Markov Model 27
  28. 28. Hidden Markov Model Series of Words 28
  29. 29. Hidden Markov Model Series of Part-of-Speech 29
  30. 30. Hidden Markov Model 30
  31. 31. Hidden Markov Model 31
  32. 32. HMM state diagram NNP 0.667 VB 0.0 . 0.0 JJ 0.0 AT 0.333 1.0 1.0 0.4 0.6 0.6670.333 alice 0.2 apple 0.4 mike 0.2 orange 0.2 ate 0.333 is 0.333 likes 0.333 an 1.0 red 1.0 . 1.0 32
  33. 33. Agenda • What s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon) 33
  34. 34. Transliteration Transliteration is a process of transcribing letters or words from one alphabet to another one to facilitate comprehension and pronunciation for non-native speakers. computer コンピューター server サーバー internet インターネット mouse マウス information インフォメーション examples of transliteration from English to Japanese 34
  35. 35. It helps improve recall you search English mouse 35
  36. 36. It helps improve recall but you got マウス (=mouse) highlighted in Japanese 36
  37. 37. Training data in NLP4L アaカcaデdeミーmy アaクcセceンnトt アaクcセceスss アaクcシciデdeンnトt アaクcロroバッbaトt アaクcショtioンn アaダdaプpターter アaフfリriカca エaアirバbuスs アaラlaスsカka アaルlコーcohoルl アaレlleルrギーgy train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー 37
  38. 38. Demo: Transliteration Input Prediction Right Answer アルゴリズム algorism algorithm プログラム program (OK) ケミカル chaemmical chemical ダイニング dining (OK) コミッター committer (OK) エントリー entree entry nlp4l> :load examples/trans_katakana_alpha.scala 38
  39. 39. Gathering loan words ① crawl gathering Katakana-Alphabet string pairs アルゴリズム, algorithm Transliteration アルゴリズム algorism calculate edit distance synonyms.txt store pair of strings if edit distance is small enough ② ③ ④ ⑤ ⑥ 39
  40. 40. Gathering loan words ① crawl gathering Katakana-Alphabet string pairs アルゴリズム, algorithm Transliteration アルゴリズム algorism calculate edit distance synonyms.txt store pair of strings if edit distance is small enough ② ③ ④ ⑤ ⑥ Got 1,800+ records of synonym knowledge from jawiki 40
  41. 41. Agenda • What s NLP4L? • How NLP improves search experience • Calculate probabilities using ShingleFilter • Transliteration (Application for HMM) • NLP4L Framework (coming soon) 41
  42. 42. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 42
  43. 43. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 43
  44. 44. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 44
  45. 45. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 45
  46. 46. Solr ES Mahout Spark Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log Dictionaries ・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment maintenance Model files Tagged Corpus Document Vectors ・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection ・Learning to Rank ・Personalized Search 46
  47. 47. Keyword Attachment • Keyword attachment is a general format that enables the following functions. • Learning to Rank • Personalized Search • Named Entity Extraction • Document Classification Lucene doc Lucene doc keyword ↑ Increase boost 47
  48. 48. Before Learning to Rank targetresult 1 2 3 … 50 100 500 … 48
  49. 49. After Learning to Rank targetresult 1 2 3 … 50 100 500 … 49
  50. 50. Learning to Rank • Program learns, from access log and other sources, that the score of document d for a query q should be larger than the normal score(q,d) Lucene doc d q, q, … https://en.wikipedia.org/wiki/Learning_to_rank 50
  51. 51. Personalized Search targetresult 1 2 3 … 50 100 500 … q=apple computer … 51
  52. 52. Personalized Search target result 50 100 500 … 1 2 3 … q=applefruit … 52
  53. 53. Personalized Search • Program learns, from access log and other sources, that the score of document d for a query q by user u should be larger than the normal score(q,d) • Since you cannot specify score(q,d,u) as Lucene restricts doing so, you have to specify score(qu,d). • Limit the data to high-order queries or divide fields depending on a user as the number of q-u combinations can be enormous. Lucene doc d1 q1u1, q2u2 Lucene doc d2 q2u1, q1u2 53
  54. 54. Join and Code with Us! Contact us at koji at apache dot org for the details. 54
  55. 55. Demo or Q & A Thank you! 55

×