The document discusses an analysis of an enterprise search tool's default settings for ranking and matching documents. Some key findings include:
- Document name and metadata fields weigh more heavily than term frequency in determining relevance.
- Singular and plural terms may be matched but stemmatization is limited. Synonyms and abbreviations are not matched.
- Basic wildcard searches work but advanced wildcards do not. Case sensitivity does not impact results.
- Identical documents in different formats are ranked in a predictable order. Dates and punctuation variations are accounted for in complex ways.
What Are The Drone Anti-jamming Systems Technology?
Enterprise search results predictability analysis
1. Trying to predict search results using enterprise
search tool with out of the box settings
2. Understand enterprise search capabilities at a granular
level
From a users perspective
Address general queries including…
Does position of a word in a document weigh heavier
than amount of times a word shows
Does singular vs. plural affect results / rank?
Does lemmatization work? Will a search on “good” pick
up documents with the term “better” or “best”?
3. No obvious logic determining how documents are ranked
Relevance order changes when filters are applied
Additional filters or order in which filters are applied, does not change relevancy.
Relevance order changes as additional documents are added
Document name and SharePoint fields weighed heavier than terms in the
document
The amount of word hits rank higher than amount of words per word hit
Document with “dog” 20 times in an 80 word document (25%) is more relevant than a document
with dog 15 times in a 15 word document (100%)
Case sensitivity / Casedex is not a factor to find documents, but could be a factor
on relevance
Natural wildcarding is not present
“Dog” does not pick up “dogma”, “cat” does not pickup “catholic”
No synonym matching / no thesaurus
“Canine” and “K9” did not bring up documents with the word “dog”
Lemmatization is present but erratic
“Good”, “better” and “best” hit on document with “better” in the text. All missed document with
“good“ in text even though other document results highlighted the “terms”, “good”, ”goods”, “better”
and “best”.
Search within Search does not exist
4. Basic wildcard searching worked appropriately (asterisk [*]
at end of word)
Advanced Wildcard did not work
When searching on singular, plurals were not always
brought back (‘dog’ brought back some docs with ‘dogs’
but missed some with ‘dogs’)
When searching on singular, sometimes plurals had higher
relevancy
Searching on plural did not bring back singular (‘dogs’
never brought back docs with ‘dog’)
Misspelled words in document were not picked up
Identical documents in different formats came back in the
order of : .doc, .docx, .pdf, .xls, .xlsx
5. Enterpris Search Engine
Search within search No
Lemmatization dog does not pick up synonyms, better picks up good and best, warm does not pick up hot, USB does not pick up
Universal Serial Bus, car does not pick up automobile
Stemming Searches singular and plural, but not lemma
ie. dog picks up dog and dogs, but not dogma
Case sensitivity / Casedex Does not impact results
SharePoint Fields Weighs SP Title Description then comments, heavily
Multi-word search Default as "and"
Boolean Must be capitalized… includes different cases (singular vs plural)
OR dog OR cat, cat OR dog, comes back in a different order
NEAR Yes… includes singular and plural case
WITHIN No
AND dog AND cat, cat AND dog return same documents in different relevancy
NOT Yes... excluded word cannot be in text, title, doc name, description, comments, etc
Wildcard matching Yes
Advanced wildcard No
anti-phrasing Searches your request but asks, did you mean
Typo / Levenshtein No
Accent normalization Search with all variations of words, picks up all variations of words
Periods within Abbreviations usb pulls up usb and u.s.b. but u.s.b. does not pull up usb
Abbreviations No
Dates / Entity normalization See appendix
Punctuation variations See appendix
Soundex / sound alike No
Thesaurus / synonym matching No
Duplicates Modifies search result to show 1 document(def.docx) and show there is another document with the same information and
allows you
6.
7. Enterprise Search Engine
Weighting (Heaviest to lightest)
Document Name
SharePoint Title
SharePoint Description
SharePoint Comments vs. amount of word hits
in a document
Amount of hits on a word
* Create Date, modified date, upload date, crawl order not fully factored
Starts to
become
harder to
predict
ranking
8. Document Details Original Search:
dog
After
applying
filter
After adding
documents
11 words, 9th word “dog” 5 8 11 (7)
11 words, 9th word “dogs” 8 6 12 (8)
10 words, “Cat”, doc name dog.docx 1 1 1 (1)
10 words, all “dog” 3 3 6 (3)
10 words, Cat, 4th replaced w/ ”Dog” 7 5 10 (6)
10 words Cat, 10th word replaced w/ ”Dog” 6 7 9 (5)
1 word “farm” SP Description = “dog” 2 2 2 (2)
10 words “Catholic”, 4th replaced w/”Dogma” X X X
2 words, SP Comments contains 5 words, all
“dog”
4 4 7 (4)
15 words, all “dog”
These documents were not added until
the following day
4
20 phrases, “I am a dog” (80 words, 20 words
“dog”)
3
12 words, all “dog” 5
12 words, all “Dog” 8
12 words, all “dogs” X
12 words, all “Dogs” X
The number signifies the rank.
For 11 (7), the 11 is the overall
rank and the 7 is the rank
amongst the original 9 docs
9. Search Term: Good Better Best
Document containing “Good” x x
Document containing “Better” x
Document containing “Best” x
10. Search Term: goose geese Goose Geese gooses geeses
Document
Contains…
Goose 2 X 2 X X X
Geese 1 1 1 1 X X
11. Search Term: chicken chickin chikcen
Document Contains…
chikcen X X 1
chickin X 1 X
12. Date Formats
Search: 6/21/14 6/21/2014
June 21st,
2014
21 June
2014
June 21,
2014
06/21/14 6-21-14 06-21-2014 6-21-2014 21-Jun-14
21-Jun-
2014
Doc
contains:
June 21,
2014
X X X 3 3 X X X X X X
6/21/14 1 X X X X X 1 X X X X
06-21-14 X X X X X X X 1 X X X
June 21st,
2014
X X 1 1 1 X X X X X X
21-Jun-14 X X X 2 2 X X X X X X
13. Special Characters
Search: PS3 PS/3 PS 3 PS-3
Doc contains:
PS/3 X 3 1 3
PS-3 X 2 4 2
PS3 1 X X X
PS 3 X 4 2 4
Ps-3 X 1 3 1
Special Characters cont.
Search: AB12345 AB 12345 AB.12345 AB-12345
Doc contains:
AB-12345 X 2 2 2
AB 12345 X 1 1 1