Weitere ähnliche Inhalte Ähnlich wie Fundamentals Of Search (20) Kürzlich hochgeladen (20) Fundamentals Of Search1. The Fundamentals of Enterprise Search KMWorld 2009 Avi Rappoport, Search Tools Consulting www.searchtools.com consult9@searchtools.com www.searchtools.com/slides/kmw09/fundamentals-of-search.html 2. What’s In This Workshop Overview of enterprise search, in context Search engine processes Robot spiders, database access Indexing Security Query parsing, retrieval, and relevance ranking Usable search interfaces. Maintenance and Analytics Methods for choosing a good search engine Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 3. About SearchTools Avi Rappoport is a librarian (MLIS from Berkeley) Software developer and product manager User interface designer Long-time search consultant Editor & Publisher, www.searchtools.com Search Tools Consulting Search needs analysis and recommendations Enterprise search evaluation Outsourced search administration Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 4. Defining Enterprise Search Large scale web site search Corporate sites Institutional sites Online stores Intranet search Crossing departmental lines Opening data silos Extranets Portal Search Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 5. Similarities to Webwide Search Robot crawlers HTML over HTTP Scaling to millions of items Distributed processing Full-text indexing of content Simple query language Relevance ranking of results TF-IDF (term frequency : inverse document frequency) Familiar results list Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 6. Differences from Web Search Limited scope A site, set of sites, extranet, or intranet Few meaningful hyperlinks Page Rank and link analysis is less useful Security and access control issues Content in databases, CMSs, etc. More control Index update scheduling Some content is very valuable, other is not No search spam Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 7. Text Search vs. Database Search Indexes multiple content sources Database fields, files, web pages, feeds... Simple search commands instead of SQL Flexible indexing and retrieval Relevance ranking (this is a major issue) Does not compete for database resources Easy to scale separately from DBMS New features: spellcheck, auto complete, facets Works in the real world, from eBay to Google Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 8. Search and Information Architecture Information Architecture The art and science of organizing information for access and use. IA work enriches search Creates order and systems Provides standard vocabulary Removes ROT (redundant, obsolete, trivial) Search supplements IA Supports user vocabularies Changes dynamically with new content Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 9. Search and Taxonomy Taxonomy creates categories Labels and metadata Improves quality of search results Additional metadata extremely valuable Search crosses categories Bypasses ambiguous topic labels Useful for novices Supports user vocabulary Dynamic updates for new topics Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 10. Search & Knowledge Management KM is: “The process through which organizations generate value from their intellectual and knowledge-based assets.” (CIO Magazine) Organizes information, processes and people Offers collaboration and archiving tools Attempts to regularize implicit knowledge Search mostly matches words Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 11. Two Main Types of Search Known-item search Short queries “Good-enough” answers Exploratory search Research - finding unknowns Scientific, legal, medical, business, sales Conceptual overviews Completeness - all possible relevant items Law enforcement Medicine Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 12. All people see are the search box and results list Invisible functionality Indexes Query processing Retrieval Relevance ranking Search is a mystery But it’s just software Search as an Iceberg Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 13. Elements of Search Engines Automated tools to collect content Specialized storage for quick retrieval Query processing and expansion Retrieval (matching query to index content) Relevance ranking Search results interfaces Analytics, metrics and maintenance Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 14. Choosing Content To Index Information sites Consider indexing every single page Use search indexing as a discovery mechanism Online stores, catalogs Product information: cost, color, size, materials Other: return policies, CEO’s name, jobs listing Intranets Intranet portal and core servers May need archive servers and search Multimedia: images, audio, video Metadata at least Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 15. (Near) Real Time Indexing Twitter has changed expectations Even in intranets Index must support partial updates Search engines finding limits at scale Distribute indexing and indexes Trigger index updates (push vs. pull) Continuous feed Send web service message Database trigger Update watched URLs with new links Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 16. Indexing and Security Search can undermine “security by obscurity” One link can expose a whole set of documents Work with your security team List areas which contain sensitive content Define words which trigger further analysis Create a process for removing sensitive data Indexing encrypted content Search engine uses SSL client for indexing Encrypt search results before returning Physical security on search servers Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 17. Search and Access Control Authentication and authorization in indexing “Basic authentication” - user name and password NT Security integration ACLs and single sign-on Conform to security rules during indexing Keep access control info as part of document store Showing results - who can see what? Access to search engine itself Collection-level access control Locked results as teaser for subscription Hit-level access control Check before displaying results Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 18. Indexing: Sources of Content Web sites Intranets Extranets Blogs Wikis Mailing list archives & email public folders File systems & shared servers NFS, SMB, AFP, GFS, ftp, WebDAV Content Management Systems Databases Legacy programs in silos Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 19. Indexing: Robot Spiders Start with base URL for all hosts For each page, repeat Read text into internal format Save document in cache Save words into index Extract all links and check the rules If they are new URLs, add them to the list Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 20. Robot Indexing Spider Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 21. Common Problems With Robots Pages that are not linked from anywhere Spider disallowed by robots.txt or robots meta URLs with ? and & (all should do these now) JavaScript, forms, and interactive dynamic links Some robots can handle some of these Session IDs that change Duplicate detection Multiple views of the same data (Lotus, wikis) Symbolic links & bad redirects Multiple copies of files or directories Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 22. Indexing: Other Data Sources RSS feeds: nice clean text File servers: SMB, file:/// etc. Content / Document Management Systems Email archives Databases via ODBC, JDBC, Oracle API Full-text content Metadata: library catalog records, yellow pages External sources using APIs (Application programmatic interfaces) News feeds (Reuters, AP) Twitter Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 23. Indexing: Text Files Plain text is easy RTF export format text easy to find HTML semi-structured text Content is between tags and in attributes Generated by JavaScript - hard to extract Bad HTML, especially missing </ close tags XML files (structured) Many tags are document-level Content is between tags and in attributes Complex tag hierarchy TEI (Text Encoding Initiative) & Semantic Web Xquery and XPATH tools Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 24. Indexing: Binary File Formats PDF Scanned, may not have any text Bad PDF generators break words at columns “Shadow” text effect duplicates letters SWF and Flash: API may not load dynamic text Office documents Word processing files (may have hidden text from revisions) Spreadsheets (hard to know what to grab) Presentations Note: new docx, xslx, pptx are really XML file sets CAD and project files Metadata (properties, Adobe XMP) Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 25. Indexing: Tokenizing Lowercase all characters (aka ‘folding’) Tokenizing makes words searchable Break on punctuation and spaces Recognize special words: C++ @ [TS] Typography issues: st is really “st” HTML escaped text: möchten = möchten Special cases for structured strings Numbers, Prices, Dates N-grams - an alternate approach Break into short text patterns Takes a lot of index space Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 26. Indexing: Character Set Issues World has many charsets (aka scripts, alphabets) English has a simple alphabet: 26 letters, 10 numbers Other Roman languages: extended (ç, î, ß) Non-Roman one byte: Cyrillic, Arabic, Hebrew Asian two bytes: Chinese, Japanese, Korean Identifying character sets Unicode characters Older usage: language “code pages” HTTP header or <META http-equiv> Statistical detection techniques Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 27. Indexing: Language Issues Text search works across languages Simple pattern-matching, query to index Language-specific indexing improves search Tokenizing using appropriate rules Compound nouns (kindergarten) Language rules for stemming Singular version of thés is thé Language detection Trusted tags Bilingual dictionaries Statistical matches, n-grams Documents may have mixed languages… Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 28. Indexing: Multimedia Images, photos, drawings, sound, scores, video External metadata File name Link text, surrounding words Internal metadata ID3 tags for music EXIF and other digital photo information Subtitles (sometimes) Content OCR to extract graphic text and closed captions Audio: Speech-to-text conversion, still buggy Use human judgment not just automated systems Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 34. Each token hasID of sourceFundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 35. Richer Index Structures Store word position (for phrase matching) Enclosing tag or field Document metadata Database field names Image (which attribute) Named anchor text Text markup tags (TEI, Semantic Web) Extracted entities Personal names, companies, geo locations, dates Anchor text from incoming links Can be very descriptive Add to index as if part of the target document Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 36. Example Inverted Index Structure For each word Document ID Position Tag name For each document ID Title URL Description Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 37. Indexing: Stopwords Stopwords - very common terms Linguistic (a an the as he she it you new) Ubiquitous (names, copyright, click here) Consequences of excluding stopwords: Reduces the size of index files Improves recall, finds more matching documents Fails some queries As You Like It, IT copyright policy Problems matching phrases: “New York University” Solutions vary: Index everything, pay the price in index size CommonGrams: n-grams of of frequent phrases Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 42. External search finds over 3,000 pages on site with phraseFundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 43. Indexing: Stemming Singular query should find plural words & vice versa Shoe <=> shoes, cans <=> can, geese <=> goose Statistical and probabilistic truncation rules Linguistic rules Lemmatization - stemming based on part of speech Stemming before indexing Improve recall: find all forms of a word Reduce index size Consequences of extreme stemming Short query problems Search for Ranshouldn’t match Run, Lola, Run Other options Index everything (makes indexes larger and queries slower) New idea: CommonGrams (n-grams of frequent phrases) Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 44. Indexing: Document Store Minimum ID (key for for inverted index) Unique location (URL / file path / record ID) Richer document store Implicit metadata: filename, size, location Explicit metadata Title, date, keywords, author Taxonomy labels, classification, user tagging Language, character set Access control settings Full text of the document For snippets and caching Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 45. Indexing: Dealing with Duplicates Detecting duplicate documents Exact match is fairly easy: checksums Document similarity check: harder but worth it Choosing the primary copy Most recent (if reliable) Rules based on path or metadata New web search “canonical” tag What to do with duplicates Remove from the index: saves space Hide in results unless requested That’s the Google way Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 46. Indexing: Document Dates HTTP servers lie about dates Frequent wrong settings: 1969, 2040 Dynamic pages send the current timestamp File systems lie about dates Applications lie about dates Indexers do the best they can Metadata (date tag, property, tag DC.date) Extract from page content Checksum to see if file has changed since last index Consider external metadata repository Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 47. Search Process Flow Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 48. Where the Queries Come From User-entered text in search fields Search navigation: moving around in results list Previous searches May just be repeated clicks on URL Save Search feature Simplistic alerts Facet click to add a metadata filter May re-issue search with additional terms May be navigational, no text query Scripts or automated queries Dynamic links (find all pictures by this artist) Geographic information systems Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 49. Query Processing Steps Try to recognize the character set and language Tokenize the text by language rules Break at spaces and punctuation Same algorithm as index tokenizer Check for operators Internet Query Operators: + - "quotes" Boolean Operators: AND OR NOT & | ! Others: NEAR, (parentheses) Check for field names, zones, other filters Example: title:lunch location=94703 Handle the rare natural language question Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 50. Query Expansion Stemming Dependant on index stemming choices Good to find singular/plural forms Word similarity searching - increases recall Fuzzy matching Phonetic, soundex, sound-alike May overwhelm exact matches Synonym expansion, should be site-specific bus => coach, ATM => Air Tasking Message Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 51. Search: Retrieval, Recall & Precision Retrieval Finding the documents matching a particular query Recall Finding every relevant document Precision Finding only relevant documents Balance more recall vs. better precision Use search logs and user studies to guide choices Use precision as part of relevance ranking Top results should be more exact matches Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 52. One-Word Text Retrieval Fastbinary search in inverted index Check index updates on disk or in memory If there are distributed indexes, merge results Store the related document information in a list Document ID Term frequency in document Term positions in the document Note: The document list is not yet sorted Frequent searches may be cached “Short head” vs. “long tail” Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 53. Multi-Word Text Retrieval Relationship between words defines results Boolean AND, + operator, find all default Only documents which contain all terms Boolean OR operator, find any default All documents with any term Boolean NOT, - operator All documents with the first term but not next term Phrase operators, quotes Only documents with the words as a phrase Also check for zones or field filters Parentheses: use for order of processing Merge resulting lists Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 54. Relevance Ranking Algorithms Relevance The likelihood that an item will fill an information need Based on documents in retrieval list Most common algorithm: TF:IDF (Term frequency : inverse document frequency) How often the query word is in the document? How often the word is in the index? Other relevance algorithms Vectors and document-query similarity Linguistic analysis and Natural Language Processing Statistical and Bayesian analysis Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 55. Relevance Heuristics Phrase matches for multiple query terms Logs show most multi-word searches are phrases Query terms found in special sections Title Metadata Top of document All terms matched in document Even when not relevant, it’s transparent Old systems gave excess weight to single rare terms Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 56. More on Relevance Relevance is task-specific Results can never please all of the people More like berry-picking than like hunting Link analysis (PageRank) not very useful Intranet and site links tend to be navigational Situation-specific adjustments Some areas more likely to be valuable Current content Local content Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 57. Federated Search and Relevance Send query to multiple search engines May require special syntax Response time often a factor Receive results in relevance order for each Display results, two options Separate sections for each search engine Merged single relevance rank list Works if all search indexes are similar Problems where the sources are very different Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 58. Retrieval: Access Control Limit access to search itself User enters password or other credentials Search only accepts queries when authenticated Collection-level access control Query filter only retrieves items from allowed groups Hit-level access control Real-time check for user access on documents Start with most relevant documents Repeat until there are ten (may be slow) Display top results, include estimate of how many more Show helpful message if user can’t see any Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 59. Search User Experience Limit user interface complexity Show the scope of the information covered Expose query expansion and contraction Use familiar UI elements User experience goes beyond interface Index coverage Query syntax Retrieval quality and speed Relevance ranking (first ten are vital) Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 60. Search Forms Interface Balance simplicity with functionality Put a search field in the navigation bar Location should be consistent Longer is better: short fields lead to short queries Simple Search forms: limit options Zone or section Dates Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 61. Search Field Auto-Complete Dropdown menu of matching words Base on search logs Smallish list, 7-10 Most popular Simple sort Alphabetic Price or size Complete range (preferably lowestto highest) Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 62. Other Search Interfaces Heavily researched Natural language Must keep typing Defining a questionis quite hard Interactive search Guided interviews But users want immediate results Avatars do not improve interaction Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 63. Simple vs. Advanced Search UI Most searches are simple Short: one to three words Fewer than 10% use any operators at all (maybe 1%) Even experts prefer simple search Will use advanced tools if simple doesn’t work Default to simple search, link to advanced search Those are your power users: librarians, techies Expose all possible options Don’t spend huge resources on advanced UI Exploratory search is different Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 64. Advanced Search Fits Sometimes EBay High motivation Complex search requirements Frequent use UX testing still required Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 65. Search Results: Page Elements Site context General page layout, navigation links Colors and design elements Results header A search field, with the current search terms Retrieval information - how many hits Results list in relevance order Each result item with at least a linked title Facets: dynamic links for filtering results Results footer Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 68. Search Results: Not-So-Good Example Site page has navigation, colors: search results should too Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 69. Search Results: Visualization Fascinating to look at, great demos Star charts Topographical displays Interactive fly-throughs Hyperbolic trees Require significant resources to run Good for exploratory & comprehensive research Finding unexpected synergies Simple search is much cheaper for casual users Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 70. Search Results: Header Elements Search field, with the current query Users often edit to be more or less restrictive Number of results found A few search options Match Any Word / All Words / Exact Phase Filter by date option (if trustworthy) Search zones Results navigation Best Bets Spelling suggestions Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 71. Search Results: Hits and Pages Show number of items matched Be accurate Do not give estimates for small numbers (Google and SharePoint are bad this way) Pagination - results list navigation Helps user calibrate content Important for exploratory search Follow web search conventions, example < previous1 2 34 ... 26next > Be accurate Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 74. Best Bets Example Best Bets are very clear Would not come first in normal search results Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 75. Search Results: List Sorting List of links to items matching the query Sorted by matching terms Impossible to be relevant to every query Variety of sources when possible Transparency: why these items in this order Other sort orders - make very visible By author’s last name By date By price Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 76. Search Results: Not Enough Variety Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 78. Degree icon should be on the left side Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 79. Result Items: Elements Information foraging: show hints about items Title of document, or name of product Location: URL, file path, database ID May need to rewrite to user-accessible URLs Hide location if it’s not meaningful Distinguishing data Metadata: picture, product code, author name Show match terms in context (snippets) Text before and after query term matches Highlight the matches Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 80. Results Items: Not Enough Content Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 81. Results Items: Too Much Content Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 82. Results Items: Just Right Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 83. Results Items: Additional Data Date (if reliable) Size and File type Avoid surprising launches of Acrobat or other app. Metadata Author, department, brand, product... Access status: password required? Topics and subject headings Taxonomy categories Keywords and concept tags User tags, folksonomy Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 84. Results Items: Rich Items Example Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 85. Results: Dynamic Clustering Uses search results text to infer topics Groups by similarity in titles and results text Particularly good for portals and intranets Unstructured, uncontrolled text Dynamic, no preprocessing needed Can supplement categorization and taxonomies Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 87. Commerce and Catalog Results Picture or graphic if possible Important attributes Price Color Size Compatibility Availability “Buy” button Simplify process, save time Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 88. Online Store Results Example Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 89. Multimedia Search Image, audio, and video files Audio and visual similarity search still theory Show context in results Match terms from transcript or OCR Text around image Thumbnails or keyframes Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 91. Results: Faceted Metadata Better than forms for structured text data Exposes attributes as part of search results Leverages metadata Topic names, taxonomy Mundane stuff: color, date, size, author... Choices specifically relating to search results Dynamically generates from metadata Preview numbers offer users confidence in clicking Supported by extensive usability testing Used on a majority of large e-commerce sites Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 92. Why Faceted Search is Better Than Forms Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 95. No Matches Queries: Causes Misspellings and typing errors Scope problem: nothing for that topic Vocabulary differences Users may be less precise, or use competitor’s terms Marketers may dominate content Restrictive search settings Default may only match exact phrase or all words Access control may disallow user Software/hardware/network failures Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 96. No Matches Queries: Responses Track queries with no matches in logs Use sessions, surveys & testing to find user intent Design the no-matches page carefully Explain what is and isn’t on the site Provide useful navigation links Add search engine help Synonyms Best Bets Spelling Add terms to text Add content, topic pages Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 97. No Matches Queries: Spelling Issues Detect and address common problems Spelling errors Typos Queries without spaces between words Use site-specific dictionary Easy to build from search index Never suggests any words not on the site Users familiar with did you mean....? Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 98. Good Example of No-Matches Page Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 100. Search Engine Maintenance Index maintenance Obsolete content removal Check for new content Track technical problems (bad links, servers down) Search quality Re-run test suite Compare with original results Add new test queries Track user feedback, surveys Use metrics and log analysis to catch trends Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 101. Metrics for Search Engines Server uptime Errors: how often and how serious Index Size on disc and in memory Number of entries Number and type of indexing errors Search traffic Queries per minute (60 qpm is common) Average clicks on results items per query Average next-page views per query Number and percent of no-match queries Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 102. Search Log Analysis Most frequent query terms Short head: a few very popular terms Long tail of unique queries Lots of junk: URLs, spam, gibberish Frequent query terms not matched - fix somehow More esoteric analysis - need a lot of data Frequent query terms with low click-through Frequent query terms with high “next page” clicks Raw logs Import into database for ad-hoc reports Session analysis can be enlightening Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 103. Choosing a Search Engine Find specific information needs Analyze content Source and formats formats Rough number of pages/ records / items Define platform, API, language requirements Buy (or use open source), don’t build User surveys show problems with home-grown Choose & compare likely candidates Gathering, indexing, retrieval, relevance features Scaling Administration tools Continuing development, support, user groups Price Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 105. Content Inventory Work with Information Architects Use existing taxonomies and catalogs Learn what you have Simple static HTML pages Other formats: PDF, Office documents (which version) CMS, document management, publishing systems Databases and legacy systems Multimedia audio and video files Identify more and less valuable data Some content should be in archives Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 106. Search Engine Deployment Types Software Controlled by local IT Flexible installation Open-source - several high quality packages Search Appliances Server hardware/software combinations Require very little technical attention Check development and backup server pricing Remote Search Services (SaaS) Index using robot spiders or remote access Query goes to service, results go back to user Low network, hosting, IT load Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 107. Scaling Search to Millions & Billions What are the largest installations for each? Talk to them before committing Cache frequent queries Add query servers, automated load balancing Indexing at scale Indexing on dedicated servers Deal with new calls for near-real-time indexing Distribute multiple clones of indexes Segment indexes, parallel lookups, merge result Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 108. Testing Search Indexing Choose 3-4 good candidates Index as much content as possible Watch the robot, track errors Try to index tricky data sources Compare coverage among them Test index scaling Make a really big index based on expected use Speed of add/ update/ delete Responsiveness during big update Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 109. Evaluating Search Results Create a query test suite Use existing search logs if possible Short, long, unusual, common (check cache) Simple and complex queries Spelling, typing and vocabulary errors Many matches, few matches, no matches Perform searches against the test engines Save results pages as HTML for later checking Analyze differences among them Retrieval (and indexing): what’s found? Relevance: are the top results good ones? Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com 110. Search: Not a Black Box Simple search solves many enterprise problems Dynamic access to local content Familiar interface, expectations User vocabulary Understand the real information needs Index the right stuff Work with content providers and IAs Link to specialty research engines Learn from users over time, make it better Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com Hinweis der Redaktion http://www.slideshare.net/bdelacretaz/beyond-fulltext-searches-with-lucene-and-solrGreat book: Search User Interfaces"by Marti Hearst