Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
© 2016 Knorex
Marrying Elasticsearch with
NLP to solve real-world
search problems
Phu Le, Knorex
@ Grokking TechTalk
25 Ju...
© 2016 Knorex
Knorex Lumina Web ServicesTM
2 / 36
© 2016 Knorex
Knorex Lumina Web ServicesTM
3 / 36
© 2016 Knorex
Knorex Lumina Web ServicesTM
4 / 36
© 2016 Knorex
Knorex Lumina Web ServicesTM
5 / 36
© 2016 Knorex
1. Architecture
2. Ingredients
• Data gathering
• Content extraction
• Preprocessing
• Modelling: terms -> p...
© 2016 Knorex
Architecture
7 / 36
© 2016 Knorex
1. Data gathering
• Deep crawler
• Lazy crawler
• Visual scraper
• Social media adapters
2. Content extracti...
© 2016 Knorex
Content extraction
9 / 36
© 2016 Knorex
Content extraction
10 / 36
© 2016 Knorex
3. Preprocessing
• Sentence splitting, Tokenization
• Stemming vs Lemmatizing
• Stemming: cries, crying, cri...
© 2016 Knorex
3. Modelling
• Goal: synthesizing words, tokens into larger units and
attach meaning to them
• Key phrases e...
© 2016 Knorex
Terms
13 / 36
© 2016 Knorex
Phrases
14 / 36
© 2016 Knorex
Entities
15 / 36
© 2016 Knorex
Document classification
16 / 36
© 2016 Knorex
• First released Feb 2010, among fastest-growing open-
source projects, total funding $104M (3 rounds)
• Bas...
© 2016 Knorex
Analysis
”analyzer”: “standard” ”analyzer”: “whitespace” ”analyzer”: “keyword”
18 / 36
© 2016 Knorex
Analysis
Successful!
[“https”,
“www.facebook.com”,
”events”,
“194454270949757“]
No hits! WTH… it is not work...
© 2016 Knorex
Analysis
I
n
Search
analyzer
Index
analyzer
Elasticsearch
index
Search Index
• Design carefully what fields ...
© 2016 Knorex
Faceting and filtering
21 / 36
© 2016 Knorex
Do you mean
• “grok” -> “grokking”, “sear” -> “search”
• Natural approach:
• Compute terms aggregation (face...
© 2016 Knorex 23 / 36
© 2016 Knorex
Do you mean
• Limitations
• Single terms only. Cannot suggest phrases
• Terms occurring frequently might not...
© 2016 Knorex
Do you mean
• Elasticsearch built-in suggester
• FST example. Source: https://www.elastic.co/blog/you-comple...
© 2016 Knorex
Do you mean
• Speed test: 1 millions articles, 2.7 GB index size on single laptop with SSD
• Cautions
• Don’...
© 2016 Knorex
Percolator
• percolate: match documents against queries
27 / 36
© 2016 Knorex
Percolator
• Sample use case: segmenting articles using keywords
28 / 36
© 2016 Knorex
Recommendation
• Natural approach
• More-like-this or fuzzy-like-this on title, content
• Not accurate, bag-...
© 2016 Knorex
Recommendation
• Sophisticated scoring and ranking
can be done outside of Elasticsearch
• Still, can tap on ...
© 2016 Knorex
Deduplication
• Natural approach
• Term matching on URL, title
• Failed if these are slightly different (ver...
© 2016 Knorex
Deduplication
• Do not index duplicate at all
or
• Collapse similar items in search results, display only th...
© 2016 Knorex
Further reading
• Dismax vs bool queries
• Term vs text queries
• Filter vs filtered
• Facets (old) vs aggre...
© 2016 Knorex
Summary
• ES is very flexible with numerous features and knobs
• Critical to understand basic analysis, diff...
© 2016 Knorex
About Knorex
Founded in 2010 as spin-off from Data Mining Dept. of
A*STAR, Singapore
 Enabling our customer...
© 2016 Knorex
https://www.knorex.com
https://itviec.com/companies/knorex
36 / 36
© 2016 Knorex
Thank you
Nächste SlideShare
Wird geladen in …5
×

TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world search problems

5.656 Aufrufe

Veröffentlicht am

Marrying Elasticsearch with NLP to solve real-world search problems - Phu Le (Knorex)

Veröffentlicht in: Technologie
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

TechTalk #13 Grokking: Marrying Elasticsearch with NLP to solve real-world search problems

  1. 1. © 2016 Knorex Marrying Elasticsearch with NLP to solve real-world search problems Phu Le, Knorex @ Grokking TechTalk 25 June 2016 Web : http://knorex.com Email : info@knorex.com
  2. 2. © 2016 Knorex Knorex Lumina Web ServicesTM 2 / 36
  3. 3. © 2016 Knorex Knorex Lumina Web ServicesTM 3 / 36
  4. 4. © 2016 Knorex Knorex Lumina Web ServicesTM 4 / 36
  5. 5. © 2016 Knorex Knorex Lumina Web ServicesTM 5 / 36
  6. 6. © 2016 Knorex 1. Architecture 2. Ingredients • Data gathering • Content extraction • Preprocessing • Modelling: terms -> phrases, entities -> documents 3. Elasticsearch • Basic analysis, faceting and filtering • Do you mean • Percolator • Recommendation • Deduplication 3. Summary Outline 6 / 36
  7. 7. © 2016 Knorex Architecture 7 / 36
  8. 8. © 2016 Knorex 1. Data gathering • Deep crawler • Lazy crawler • Visual scraper • Social media adapters 2. Content extraction • Take news article as an example • Title • Content • Published date • Author • Image • … Ingredients 8 / 36
  9. 9. © 2016 Knorex Content extraction 9 / 36
  10. 10. © 2016 Knorex Content extraction 10 / 36
  11. 11. © 2016 Knorex 3. Preprocessing • Sentence splitting, Tokenization • Stemming vs Lemmatizing • Stemming: cries, crying, cried => cri • Lemmatizing: dogs => dog; is, are => be Ingredients 11 / 36
  12. 12. © 2016 Knorex 3. Modelling • Goal: synthesizing words, tokens into larger units and attach meaning to them • Key phrases extractions • Named entity recognition • Basic building block of knowledge • Basis for computing relatedness and extracting relations • Sentiment analysis • Social media snippet • General article or towards concepts / named entities • Emotion • Document classification • Group search results into faceted categories • Recommend related articles by category Ingredients 12 / 36
  13. 13. © 2016 Knorex Terms 13 / 36
  14. 14. © 2016 Knorex Phrases 14 / 36
  15. 15. © 2016 Knorex Entities 15 / 36
  16. 16. © 2016 Knorex Document classification 16 / 36
  17. 17. © 2016 Knorex • First released Feb 2010, among fastest-growing open- source projects, total funding $104M (3 rounds) • Based on Apache Lucene (same as Solr) • Written in Java, support HTTP interface, schema-free JSON document (yay no XML!) • Designed to be scalable, distributed in nature 17 / 36
  18. 18. © 2016 Knorex Analysis ”analyzer”: “standard” ”analyzer”: “whitespace” ”analyzer”: “keyword” 18 / 36
  19. 19. © 2016 Knorex Analysis Successful! [“https”, “www.facebook.com”, ”events”, “194454270949757“] No hits! WTH… it is not working!!!! Default analyzer as-is • url => not_analyzed / keyword analyzer • Use match query instead of term filter / term query: field analyzer awareness • Custom analyzer: e.g. keyword tokenizer + lowercase filter 19 / 36
  20. 20. © 2016 Knorex Analysis I n Search analyzer Index analyzer Elasticsearch index Search Index • Design carefully what fields that search will be executed frequently on • Determine what analyzers to use for each field (experimental based on application needs) • Search analyzer and index analyzer might be different for the same field • Use match query instead of term filter / term query: field analyzer awareness • Exploit multi-field 20 / 36
  21. 21. © 2016 Knorex Faceting and filtering 21 / 36
  22. 22. © 2016 Knorex Do you mean • “grok” -> “grokking”, “sear” -> “search” • Natural approach: • Compute terms aggregation (facet) across all text fields • title • description • content • Use regex to filter matched terms, sort DESC by frequency, take most popular terms to suggest DON’T!!! 22 / 36
  23. 23. © 2016 Knorex 23 / 36
  24. 24. © 2016 Knorex Do you mean • Limitations • Single terms only. Cannot suggest phrases • Terms occurring frequently might not be useful • Improvements • Building another field “phrases” in the document • adding entire title • Using key phrases extraction, named entity recognition to populate meaningful phrases • Custom tokenizers: keyword, edgeNGram • edgeNGram example: “grokking” => “gro”, “grok”, “grokk” • Query: “burs mal” => matched: “bursa malaysia” • memory explosion!!! • Custom scoring (importance, popularity score) instead of term frequency 24 / 36
  25. 25. © 2016 Knorex Do you mean • Elasticsearch built-in suggester • FST example. Source: https://www.elastic.co/blog/you-complete-me • Features: • Speed & scale: FST per-segment, build in real-time, scale horizontally • Analysis: synonym, fuzzy • Support custom ordering and scoring • Limitations: can’t find word anywhere within a phrase 25 / 36
  26. 26. © 2016 Knorex Do you mean • Speed test: 1 millions articles, 2.7 GB index size on single laptop with SSD • Cautions • Don’t add all terms/phrases to suggestion (only meaningful ones!) • Don’t start suggesting immediately. How many words starting with “c”? • Don’t suggest terms that yield no search results • Apply same filter condition of current query to the term suggestion query Regex terms facet Terms suggester 296.5 ms 13 ms 26 / 36
  27. 27. © 2016 Knorex Percolator • percolate: match documents against queries 27 / 36
  28. 28. © 2016 Knorex Percolator • Sample use case: segmenting articles using keywords 28 / 36
  29. 29. © 2016 Knorex Recommendation • Natural approach • More-like-this or fuzzy-like-this on title, content • Not accurate, bag-of-word approach. • Tricky in determining threshold. ”Good value” varies across different document types and domains • Slow. The more terms allowed in the queries, the slower it is. If cut off based on max terms, then accuracy drops • Proposed approaches • Utilize NLP results (modelling step): • Category: recommend articles from same categories • Key phrases: match and rank documents w.r.t target documents by key phrases • Named entities: model with parent/child relationship • Combine with function score feature to rescore results • Example: applying a Gauss decay function to favor more recent results 29 / 36
  30. 30. © 2016 Knorex Recommendation • Sophisticated scoring and ranking can be done outside of Elasticsearch • Still, can tap on Elasticsearch for faceting and filtering capability 30 / 36
  31. 31. © 2016 Knorex Deduplication • Natural approach • Term matching on URL, title • Failed if these are slightly different (very common!) • More-like-this or fuzzy-like-this on content, with high matching threshold: e.g. 70%, 80% • Not accurate, bag-of-word approach. • Tricky in determining threshold. ”Good value” varies across different dcoument types and domains • Slow. The more terms allowed in the queries, the slower it is. If cut off based on max terms, then accuracy drops • Proposed approach • Semantic hashing: minhash, simhash • for a document, compute a hash value • convert the hash value to binary string form • robust and efficient, can cater to near-duplicate • Implement Hamming distance search using Elasticsearch fuzzy_like_this 31 / 36
  32. 32. © 2016 Knorex Deduplication • Do not index duplicate at all or • Collapse similar items in search results, display only the one with highest score • Assign same id for articles that are duplicate (called it groupid) • Use Elasticsearch Top Hits query to collapse result by groupid ⇒ 64-bit hash: 1000010001000111101001011011110010111101000011100 101101001011101 Modified version: 1010010001000111101011011011110010111101000011100 101101000011101 Hamming distance: 3 32 / 36
  33. 33. © 2016 Knorex Further reading • Dismax vs bool queries • Term vs text queries • Filter vs filtered • Facets (old) vs aggregations (facets reborn + statistics) • Geo 33 / 36
  34. 34. © 2016 Knorex Summary • ES is very flexible with numerous features and knobs • Critical to understand basic analysis, different types of queries • Indexing time and search time tradeoff • Precision and recall tradeoff • Complexity and memory estimation • Use NLP techniques as modelling step to improve search quality • Pay great attention to data input and data gathering step 34 / 36
  35. 35. © 2016 Knorex About Knorex Founded in 2010 as spin-off from Data Mining Dept. of A*STAR, Singapore  Enabling our customers to make smarter discovery and turn it into actionable insight Mission 35 / 36
  36. 36. © 2016 Knorex https://www.knorex.com https://itviec.com/companies/knorex 36 / 36
  37. 37. © 2016 Knorex Thank you

×