Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

You know, for search

673 Aufrufe

Veröffentlicht am

Presentation I did at the Elasticsearch meetup dd Juli 12th 2016.

Veröffentlicht in: Software
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Why don't you show Russian Paris? ;) https://en.wikipedia.org/wiki/Parizh,_Chelyabinsk_Oblast
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

You know, for search

  1. 1. De Bitmanager, 2016 You Know, for Search Peter van der Weerd
  2. 2. De Bitmanager, 2016 Who am I? • Peter van der Weerd • Search specialist • Self employed Bitmanager • Enormous span of control 
  3. 3. De Bitmanager, 2016 Search • Common sense: Easy Solved
  4. 4. De Bitmanager, 2016 Yeah, true… • Install ES • Fill it with some data • And o/: we can search
  5. 5. De Bitmanager, 2016 But… • Are the users satisfied? • Many people struggle with sub-optimal search results.
  6. 6. De Bitmanager, 2016 Search as a toolbox • It consists of 1 or more(!) tools to find what you need Searchbox Faceting (intersecting) Sorting More like this Not more like this (this is not what I mean) Etc…
  7. 7. De Bitmanager, 2016 Search at Booking • Destination based (city, region, airport, etc) • Autocomplete Results in max 5 destinations, query per keystroke • Disambiguation Show a partioned result that enables people to choose a destination
  8. 8. De Bitmanager, 2016 Autocomplete in action
  9. 9. De Bitmanager, 2016 Disambiguation in action
  10. 10. De Bitmanager, 2016 Scoring
  11. 11. De Bitmanager, 2016 Scoring • Lucene scores in general like: tf * idf • Tf = term frequency the more matched terms, the more important • Idf = inverse document frequency The more matched documents for the term, the less important
  12. 12. De Bitmanager, 2016 Term frequency • Used to give more importance to relative high occurring terms. • Scoring examples for ‘house’ House The house The little house on the prairie The little house on the prairie blah blah blah s c o r e
  13. 13. De Bitmanager, 2016 Inverse document frequency • Prefers less frequent tokens. • Useless on single token queries: it is only used to relative score multiple tokens • Examples: house little on the s c o r e
  14. 14. De Bitmanager, 2016 Drawback of idf • Other example… Pekela Haarlem Amsterdam Paris • Booking switched off idf, but could have used df instead… s c o r e
  15. 15. De Bitmanager, 2016 When does idf work • Idf typically work for large text-like queries. • The documents *must* be evenly distributed over shards (or use dfs_query_then_fetch)
  16. 16. De Bitmanager, 2016 Is tf * idf enough? • Well, no… • What to deliver on a query for ‘Paris’? The city (ehm, the are several cities Paris) Airports? Hotels? Which one? There are 1000’s of them. • Even worse: What to deliver for query ‘p’ or ‘pa’?
  17. 17. De Bitmanager, 2016 Record boost • Based on Popularity From where booked Language oSame (doc language == site language) oLocal translations oEnglish oMismatch
  18. 18. De Bitmanager, 2016 + or x? • Boosts are implemented by adding • Intuitive justification: Language could be seen as yet another (implicit!) search term Same for popularity: people ar typical not searching for impopular things • Example (from an english site): amsterdam->amsterdam english popular
  19. 19. De Bitmanager, 2016 But wait… • How big should the record-boost be? 0..1? 100? • Lucene score might vary heavely, sometimes more then 10x different • So lets take 10 as max record-boost But now the recordboost might out-weight smaller scores • Argggggg….
  20. 20. De Bitmanager, 2016 Score ranges • Difficult to tinker with: For instance use a stemmed token with boost 0.5 house^1.0 vs houses^0.5 What if the Lucene score is more than 2 times higher than the stem itself? • We are doing entity search vs text search
  21. 21. De Bitmanager, 2016 Different scorers Title Score:default Score:BM25 Score:custom House 1.22 0.77 1.20 The house 0.76 0.61 1.10 The little house on the prairie 0.46 0.39 1.05 Querying for ‘house’:
  22. 22. De Bitmanager, 2016 Normalizing scores • Goal: each term is scored around 1.0 Base score 1.0 Tf is normalized between 0 .. 0.2 and added to the base score Idf is normalized between 0 .. 0.2 and added to the base score Giving a score varying between 1 and 1.4 per term (sometimes we don’t use idf)
  23. 23. De Bitmanager, 2016 Language boosting • Same language or english: +0.7 • Local language: +0.3 (Roma vs Rome in an English site) • Mismatched language: -0.3
  24. 24. De Bitmanager, 2016 About N-grams • For auto-complete: left-edge N-Grams • Rome: rome rom ro r
  25. 25. De Bitmanager, 2016 About N-grams • When a user types ‘ro’… Rome Ródos Rotterdam Etc • Score depends on percentage of match (or Levenshtein distance) s c o r e
  26. 26. De Bitmanager, 2016 Original approach • Multiple fields (name, city, region, etc) • Combining them by a weighted dismax query
  27. 27. De Bitmanager, 2016 Dismax query • More subtle way of combining scores. • Score = max + (sum - max) * tieBreaker In words: the max plus a percentage of the others • Edge cases: Tiebreaker=0 Score is the max. score Tiebreaker=1 Score is the sum of all the individual scores (same behavior as boolean or)
  28. 28. De Bitmanager, 2016 Dismax example • Q= the house Suppose S[the] = 0.8, S[house]=1.2 • Scores for different tiebreakers: Bool score (tiebreaker=1): 2.0 Max score (tiebreaker=0): 1.2 Score with tiebreaker=0.1: 1.28 this makes documents containing ‘the house’ a little bit more important than ‘house’ only.
  29. 29. De Bitmanager, 2016 Difficulties • Lack of context • Hard to create a reliable scoring model
  30. 30. De Bitmanager, 2016 Different approach • Canonical name:  Hotel V Frederiksplein, Amsterdam, Noord-Holland, Netherlands • Self name (indexed) Hotel V Frederiksplein • Rest (indexed) Amsterdam, Noord-Holland, Netherlands
  31. 31. De Bitmanager, 2016 Weighting fields • All fields are equal but some fields are more equal than others… Self name is most important Other names (like the city where a hotel resides) are less important • Dismax over self name and other
  32. 32. De Bitmanager, 2016 Payload • Small piece of information that is added to every occurrence • Basically a byte[]
  33. 33. De Bitmanager, 2016 Nowadays: payloads • We need more information per occurrence of a token: Length of the original token Self-name or other location info Type of the name (hotel, city, landmark, etc) • All the above info is encoded in a 32 bit integer, and indexed as a payload
  34. 34. De Bitmanager, 2016 Dismax vs payload • With fieldinfo in the payload we can simulate dismax behavior • We query only 1 index-field (instead of 5) • Context: easier to do advanced scoring: all info is in 1 scorer. • Payloads *are* possible in ElasticSearch, but more difficult to use
  35. 35. De Bitmanager, 2016 Search • Difficult • Sensitive equilibrium • Impossible to serve them all
  36. 36. De Bitmanager, 2016 Suits
  37. 37. De Bitmanager, 2016 Suits • Reasons for people to wear a suit might include: Hiding the fact that you cannot trust them Hiding their incompetence etc 
  38. 38. De Bitmanager, 2016 Combining fields • To prevent double counting, a dismax is adviced. • The fact that a term occurs in both the title as the abstract doesn’t make it roughly twice as important. But it does make it somewhat more important
  39. 39. De Bitmanager, 2016 Combining fields • Intuitive reaction: query terms in each others neighborhood are more important… • Example: search for a book: chamber secrets rowling • Expected top result: Harry Potter and the Chamber of Secrets/J.K. Rowling
  40. 40. De Bitmanager, 2016 Combining fields "_score": 2.0767038, "author": "De Bitmanager", "title": "Excerpt book", "abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling" "_score": 1.2030121, "author": "J.K. Rowling", "title": "Harry Potter and the Chamber of Secrets", "abstract": "Fresh torments and horrors arise, including an outrageously stuck-up new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom." • More important if in the same field?
  41. 41. De Bitmanager, 2016 Combining fields • But: we get an excerpt book that contains the requested (all terms were present in the abstract field) • Phrases behave even worse
  42. 42. De Bitmanager, 2016 Combining fields • Suppose:  we have 2 fields: F1 and F2  2 query terms: qt1 and qt2 • Now we have choices how to combine…
  43. 43. De Bitmanager, 2016 Combining fields • (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)  this will prefer records where both terms are found in the same field • (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)  this prefer behaves more like a there were no fields
  44. 44. De Bitmanager, 2016 Combining fields (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2) "_score": 2.0767038, "author": "De Bitmanager", "title": "Excerpt book", "abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling" "_score": 1.2030121, "author": "J.K. Rowling", "title": "Harry Potter and the Chamber of Secrets", "abstract": "Fresh torments and horrors arise, including an outrageously stuck-up new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom."
  45. 45. De Bitmanager, 2016 Combining fields (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2) "_score": 2.1447253, "author": "J.K. Rowling", "title": "Harry Potter and the Chamber of Secrets", "abstract": "Fresh torments and horrors arise, including an outrageously stuck-up new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom." "_score": 2.0767038, "author": "De Bitmanager", "title": "Excerpt book", "abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
  46. 46. De Bitmanager, 2016 Combining fields • Of course: way more possibilities. See the multi-match query for examples Most but not all possibilities can be done by hand (blending)
  47. 47. De Bitmanager, 2016 Combining fields • Different strategy: Combine all fields as if they were one field Do some re-scoring afterwards Example: oSearch ‘rowling’ anywhere, score 1 oSearch ‘potter’ anywhere, score 1 oCombine with additional queries to do a finishing touch
  48. 48. De Bitmanager, 2016 Explain • Always use explain (in debug mode) • Did I already tell you to always use explain? • Create a new application by first making explain part of your infrastructure • At least expose the scores in debug mode.
  49. 49. De Bitmanager, 2016 Suits: beware the logic rules… • Cannot be reversed: • The fact that I am not wearing a suit does not imply that: I am trustworthy I am competent
  50. 50. De Bitmanager, 2016 You Know, for Bits… Peter @ bitmanager.nl

×