Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Enterprise search results predictability analysis

Enterprise search results predictability analysis

Herunterladen, um offline zu lesen

An analysis of an enterprise search tool with out of the box settings. Documents are places into a repository with specific keywords. Are we able to predict the search result order? Does an 'exact match' rank higher than a plural? Does the word showing sooner in the document weight heavier than a document with the same amount of words but the keyword showing later? How do caps, plurals, and special characters affect your search? All of this and more is analyzed and presented in this PowerPoint, modified for public review.

An analysis of an enterprise search tool with out of the box settings. Documents are places into a repository with specific keywords. Are we able to predict the search result order? Does an 'exact match' rank higher than a plural? Does the word showing sooner in the document weight heavier than a document with the same amount of words but the keyword showing later? How do caps, plurals, and special characters affect your search? All of this and more is analyzed and presented in this PowerPoint, modified for public review.

Weitere Verwandte Inhalte

Ähnliche Bücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Ähnliche Hörbücher

Kostenlos mit einer 30-tägigen Testversion von Scribd

Alle anzeigen

Enterprise search results predictability analysis

  1. 1. Trying to predict search results using enterprise search tool with out of the box settings
  2. 2.  Understand enterprise search capabilities at a granular level  From a users perspective  Address general queries including…  Does position of a word in a document weigh heavier than amount of times a word shows  Does singular vs. plural affect results / rank?  Does lemmatization work? Will a search on “good” pick up documents with the term “better” or “best”?
  3. 3.  No obvious logic determining how documents are ranked  Relevance order changes when filters are applied  Additional filters or order in which filters are applied, does not change relevancy.  Relevance order changes as additional documents are added  Document name and SharePoint fields weighed heavier than terms in the document  The amount of word hits rank higher than amount of words per word hit  Document with “dog” 20 times in an 80 word document (25%) is more relevant than a document with dog 15 times in a 15 word document (100%)  Case sensitivity / Casedex is not a factor to find documents, but could be a factor on relevance  Natural wildcarding is not present  “Dog” does not pick up “dogma”, “cat” does not pickup “catholic”  No synonym matching / no thesaurus  “Canine” and “K9” did not bring up documents with the word “dog”  Lemmatization is present but erratic  “Good”, “better” and “best” hit on document with “better” in the text. All missed document with “good“ in text even though other document results highlighted the “terms”, “good”, ”goods”, “better” and “best”.  Search within Search does not exist
  4. 4.  Basic wildcard searching worked appropriately (asterisk [*] at end of word)  Advanced Wildcard did not work  When searching on singular, plurals were not always brought back (‘dog’ brought back some docs with ‘dogs’ but missed some with ‘dogs’)  When searching on singular, sometimes plurals had higher relevancy  Searching on plural did not bring back singular (‘dogs’ never brought back docs with ‘dog’)  Misspelled words in document were not picked up  Identical documents in different formats came back in the order of : .doc, .docx, .pdf, .xls, .xlsx
  5. 5. Enterpris Search Engine Search within search No Lemmatization dog does not pick up synonyms, better picks up good and best, warm does not pick up hot, USB does not pick up Universal Serial Bus, car does not pick up automobile Stemming Searches singular and plural, but not lemma ie. dog picks up dog and dogs, but not dogma Case sensitivity / Casedex Does not impact results SharePoint Fields Weighs SP Title Description then comments, heavily Multi-word search Default as "and" Boolean Must be capitalized… includes different cases (singular vs plural) OR dog OR cat, cat OR dog, comes back in a different order NEAR Yes… includes singular and plural case WITHIN No AND dog AND cat, cat AND dog return same documents in different relevancy NOT Yes... excluded word cannot be in text, title, doc name, description, comments, etc Wildcard matching Yes Advanced wildcard No anti-phrasing Searches your request but asks, did you mean Typo / Levenshtein No Accent normalization Search with all variations of words, picks up all variations of words Periods within Abbreviations usb pulls up usb and u.s.b. but u.s.b. does not pull up usb Abbreviations No Dates / Entity normalization See appendix Punctuation variations See appendix Soundex / sound alike No Thesaurus / synonym matching No Duplicates Modifies search result to show 1 document(def.docx) and show there is another document with the same information and allows you
  6. 6. Enterprise Search Engine  Weighting (Heaviest to lightest)  Document Name  SharePoint Title  SharePoint Description  SharePoint Comments vs. amount of word hits in a document  Amount of hits on a word * Create Date, modified date, upload date, crawl order not fully factored Starts to become harder to predict ranking
  7. 7. Document Details Original Search: dog After applying filter After adding documents 11 words, 9th word “dog” 5 8 11 (7) 11 words, 9th word “dogs” 8 6 12 (8) 10 words, “Cat”, doc name dog.docx 1 1 1 (1) 10 words, all “dog” 3 3 6 (3) 10 words, Cat, 4th replaced w/ ”Dog” 7 5 10 (6) 10 words Cat, 10th word replaced w/ ”Dog” 6 7 9 (5) 1 word “farm” SP Description = “dog” 2 2 2 (2) 10 words “Catholic”, 4th replaced w/”Dogma” X X X 2 words, SP Comments contains 5 words, all “dog” 4 4 7 (4) 15 words, all “dog” These documents were not added until the following day 4 20 phrases, “I am a dog” (80 words, 20 words “dog”) 3 12 words, all “dog” 5 12 words, all “Dog” 8 12 words, all “dogs” X 12 words, all “Dogs” X The number signifies the rank. For 11 (7), the 11 is the overall rank and the 7 is the rank amongst the original 9 docs
  8. 8. Search Term: Good Better Best Document containing “Good” x  x Document containing “Better” x   Document containing “Best” x  
  9. 9. Search Term: goose geese Goose Geese gooses geeses Document Contains… Goose 2 X 2 X X X Geese 1 1 1 1 X X
  10. 10. Search Term: chicken chickin chikcen Document Contains… chikcen X X 1 chickin X 1 X
  11. 11. Date Formats Search: 6/21/14 6/21/2014 June 21st, 2014 21 June 2014 June 21, 2014 06/21/14 6-21-14 06-21-2014 6-21-2014 21-Jun-14 21-Jun- 2014 Doc contains: June 21, 2014 X X X 3 3 X X X X X X 6/21/14 1 X X X X X 1 X X X X 06-21-14 X X X X X X X 1 X X X June 21st, 2014 X X 1 1 1 X X X X X X 21-Jun-14 X X X 2 2 X X X X X X
  12. 12. Special Characters Search: PS3 PS/3 PS 3 PS-3 Doc contains: PS/3 X 3 1 3 PS-3 X 2 4 2 PS3 1 X X X PS 3 X 4 2 4 Ps-3 X 1 3 1 Special Characters cont. Search: AB12345 AB 12345 AB.12345 AB-12345 Doc contains: AB-12345 X 2 2 2 AB 12345 X 1 1 1

×