Fundamentals Of Search

The Fundamentals of Enterprise Search KMWorld 2009 Avi Rappoport, Search Tools Consulting www.searchtools.com consult9@searchtools.com www.searchtools.com/slides/kmw09/fundamentals-of-search.html

What’s In This Workshop Overview of enterprise search, in context Search engine processes Robot spiders, database access Indexing Security Query parsing, retrieval, and relevance ranking Usable search interfaces. Maintenance and Analytics Methods for choosing a good search engine Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

About SearchTools Avi Rappoport is a librarian (MLIS from Berkeley) Software developer and product manager User interface designer Long-time search consultant Editor & Publisher, www.searchtools.com Search Tools Consulting Search needs analysis and recommendations Enterprise search evaluation Outsourced search administration Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Defining Enterprise Search Large scale web site search Corporate sites Institutional sites Online stores Intranet search Crossing departmental lines Opening data silos Extranets Portal Search Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Similarities to Webwide Search Robot crawlers HTML over HTTP Scaling to millions of items Distributed processing Full-text indexing of content Simple query language Relevance ranking of results TF-IDF (term frequency : inverse document frequency) Familiar results list Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Differences from Web Search Limited scope A site, set of sites, extranet, or intranet Few meaningful hyperlinks Page Rank and link analysis is less useful Security and access control issues Content in databases, CMSs, etc. More control Index update scheduling Some content is very valuable, other is not No search spam Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Text Search vs. Database Search Indexes multiple content sources Database fields, files, web pages, feeds... Simple search commands instead of SQL Flexible indexing and retrieval Relevance ranking (this is a major issue) Does not compete for database resources Easy to scale separately from DBMS New features: spellcheck, auto complete, facets Works in the real world, from eBay to Google Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search and Information Architecture Information Architecture The art and science of organizing information for access and use. IA work enriches search Creates order and systems Provides standard vocabulary Removes ROT (redundant, obsolete, trivial) Search supplements IA Supports user vocabularies Changes dynamically with new content Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search and Taxonomy Taxonomy creates categories Labels and metadata Improves quality of search results Additional metadata extremely valuable Search crosses categories Bypasses ambiguous topic labels Useful for novices Supports user vocabulary Dynamic updates for new topics Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search & Knowledge Management KM is: “The process through which organizations generate value from their intellectual and knowledge-based assets.” (CIO Magazine) Organizes information, processes and people Offers collaboration and archiving tools Attempts to regularize implicit knowledge Search mostly matches words Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Two Main Types of Search Known-item search Short queries “Good-enough” answers Exploratory search Research - finding unknowns Scientific, legal, medical, business, sales Conceptual overviews Completeness - all possible relevant items Law enforcement Medicine Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

All people see are the search box and results list Invisible functionality Indexes Query processing Retrieval Relevance ranking Search is a mystery But it’s just software Search as an Iceberg Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Elements of Search Engines Automated tools to collect content Specialized storage for quick retrieval Query processing and expansion Retrieval (matching query to index content) Relevance ranking Search results interfaces Analytics, metrics and maintenance Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Choosing Content To Index Information sites Consider indexing every single page Use search indexing as a discovery mechanism Online stores, catalogs Product information: cost, color, size, materials Other: return policies, CEO’s name, jobs listing Intranets Intranet portal and core servers May need archive servers and search Multimedia: images, audio, video Metadata at least Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

(Near) Real Time Indexing Twitter has changed expectations Even in intranets Index must support partial updates Search engines finding limits at scale Distribute indexing and indexes Trigger index updates (push vs. pull) Continuous feed Send web service message Database trigger Update watched URLs with new links Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing and Security Search can undermine “security by obscurity” One link can expose a whole set of documents Work with your security team List areas which contain sensitive content Define words which trigger further analysis Create a process for removing sensitive data Indexing encrypted content Search engine uses SSL client for indexing Encrypt search results before returning Physical security on search servers Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search and Access Control Authentication and authorization in indexing “Basic authentication” - user name and password NT Security integration ACLs and single sign-on Conform to security rules during indexing Keep access control info as part of document store Showing results - who can see what? Access to search engine itself Collection-level access control Locked results as teaser for subscription Hit-level access control Check before displaying results Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Sources of Content Web sites Intranets Extranets Blogs Wikis Mailing list archives & email public folders File systems & shared servers NFS, SMB, AFP, GFS, ftp, WebDAV Content Management Systems Databases Legacy programs in silos Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Robot Spiders Start with base URL for all hosts For each page, repeat Read text into internal format Save document in cache Save words into index Extract all links and check the rules If they are new URLs, add them to the list Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Robot Indexing Spider Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Common Problems With Robots Pages that are not linked from anywhere Spider disallowed by robots.txt or robots meta URLs with ? and & (all should do these now) JavaScript, forms, and interactive dynamic links Some robots can handle some of these Session IDs that change Duplicate detection Multiple views of the same data (Lotus, wikis) Symbolic links & bad redirects Multiple copies of files or directories Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Other Data Sources RSS feeds: nice clean text File servers: SMB, file:/// etc. Content / Document Management Systems Email archives Databases via ODBC, JDBC, Oracle API Full-text content Metadata: library catalog records, yellow pages External sources using APIs (Application programmatic interfaces) News feeds (Reuters, AP) Twitter Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Text Files Plain text is easy RTF export format text easy to find HTML semi-structured text Content is between tags and in attributes Generated by JavaScript - hard to extract Bad HTML, especially missing </ close tags XML files (structured) Many tags are document-level Content is between tags and in attributes Complex tag hierarchy TEI (Text Encoding Initiative) & Semantic Web Xquery and XPATH tools Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Binary File Formats PDF Scanned, may not have any text Bad PDF generators break words at columns “Shadow” text effect duplicates letters SWF and Flash: API may not load dynamic text Office documents Word processing files (may have hidden text from revisions) Spreadsheets (hard to know what to grab) Presentations Note: new docx, xslx, pptx are really XML file sets CAD and project files Metadata (properties, Adobe XMP) Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Tokenizing Lowercase all characters (aka ‘folding’) Tokenizing makes words searchable Break on punctuation and spaces Recognize special words: C++ @ [TS] Typography issues: ﬆ is really “st” HTML escaped text: möchten = möchten Special cases for structured strings Numbers, Prices, Dates N-grams - an alternate approach Break into short text patterns Takes a lot of index space Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Character Set Issues World has many charsets (aka scripts, alphabets) English has a simple alphabet: 26 letters, 10 numbers Other Roman languages: extended (ç, î, ß) Non-Roman one byte: Cyrillic, Arabic, Hebrew Asian two bytes: Chinese, Japanese, Korean Identifying character sets Unicode characters Older usage: language “code pages” HTTP header or <META http-equiv> Statistical detection techniques Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Language Issues Text search works across languages Simple pattern-matching, query to index Language-specific indexing improves search Tokenizing using appropriate rules Compound nouns (kindergarten) Language rules for stemming Singular version of thés is thé Language detection Trusted tags Bilingual dictionaries Statistical matches, n-grams Documents may have mixed languages… Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Multimedia Images, photos, drawings, sound, scores, video External metadata File name Link text, surrounding words Internal metadata ID3 tags for music EXIF and other digital photo information Subtitles (sometimes) Content OCR to extract graphic text and closed captions Audio: Speech-to-text conversion, still buggy Use human judgment not just automated systems Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Inverted Index Diagram ,[object Object]

Lots of IR research shows this

Tokens not in paragraph order, thus, inverted

Each token hasID of sourceFundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Richer Index Structures Store word position (for phrase matching) Enclosing tag or field Document metadata Database field names Image (which attribute) Named anchor text Text markup tags (TEI, Semantic Web) Extracted entities Personal names, companies, geo locations, dates Anchor text from incoming links Can be very descriptive Add to index as if part of the target document Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Example Inverted Index Structure For each word Document ID Position Tag name For each document ID Title URL Description Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Stopwords Stopwords - very common terms Linguistic (a an the as he she it you new) Ubiquitous (names, copyright, click here) Consequences of excluding stopwords: Reduces the size of index files Improves recall, finds more matching documents Fails some queries As You Like It, IT copyright policy Problems matching phrases: “New York University” Solutions vary: Index everything, pay the price in index size CommonGrams: n-grams of of frequent phrases Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Stopwords Problems: Example Searching wordpress.com for whatever will be ,[object Object]

External search finds over 3,000 pages on site with phraseFundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Stemming Singular query should find plural words & vice versa Shoe <=> shoes, cans <=> can, geese <=> goose Statistical and probabilistic truncation rules Linguistic rules Lemmatization - stemming based on part of speech Stemming before indexing Improve recall: find all forms of a word Reduce index size Consequences of extreme stemming Short query problems Search for Ranshouldn’t match Run, Lola, Run Other options Index everything (makes indexes larger and queries slower) New idea: CommonGrams (n-grams of frequent phrases) Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Document Store Minimum ID (key for for inverted index) Unique location (URL / file path / record ID) Richer document store Implicit metadata: filename, size, location Explicit metadata Title, date, keywords, author Taxonomy labels, classification, user tagging Language, character set Access control settings Full text of the document For snippets and caching Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Dealing with Duplicates Detecting duplicate documents Exact match is fairly easy: checksums Document similarity check: harder but worth it Choosing the primary copy Most recent (if reliable) Rules based on path or metadata New web search “canonical” tag What to do with duplicates Remove from the index: saves space Hide in results unless requested That’s the Google way Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Indexing: Document Dates HTTP servers lie about dates Frequent wrong settings: 1969, 2040 Dynamic pages send the current timestamp File systems lie about dates Applications lie about dates Indexers do the best they can Metadata (date tag, property, tag DC.date) Extract from page content Checksum to see if file has changed since last index Consider external metadata repository Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search Process Flow Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Where the Queries Come From User-entered text in search fields Search navigation: moving around in results list Previous searches May just be repeated clicks on URL Save Search feature Simplistic alerts Facet click to add a metadata filter May re-issue search with additional terms May be navigational, no text query Scripts or automated queries Dynamic links (find all pictures by this artist) Geographic information systems Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Query Processing Steps Try to recognize the character set and language Tokenize the text by language rules Break at spaces and punctuation Same algorithm as index tokenizer Check for operators Internet Query Operators: + - "quotes" Boolean Operators: AND OR NOT & | ! Others: NEAR, (parentheses) Check for field names, zones, other filters Example: title:lunch location=94703 Handle the rare natural language question Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Query Expansion Stemming Dependant on index stemming choices Good to find singular/plural forms Word similarity searching - increases recall Fuzzy matching Phonetic, soundex, sound-alike May overwhelm exact matches Synonym expansion, should be site-specific bus => coach, ATM => Air Tasking Message Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search: Retrieval, Recall & Precision Retrieval Finding the documents matching a particular query Recall Finding every relevant document Precision Finding only relevant documents Balance more recall vs. better precision Use search logs and user studies to guide choices Use precision as part of relevance ranking Top results should be more exact matches Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

One-Word Text Retrieval Fastbinary search in inverted index Check index updates on disk or in memory If there are distributed indexes, merge results Store the related document information in a list Document ID Term frequency in document Term positions in the document Note: The document list is not yet sorted Frequent searches may be cached “Short head” vs. “long tail” Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Multi-Word Text Retrieval Relationship between words defines results Boolean AND, + operator, find all default Only documents which contain all terms Boolean OR operator, find any default All documents with any term Boolean NOT, - operator All documents with the first term but not next term Phrase operators, quotes Only documents with the words as a phrase Also check for zones or field filters Parentheses: use for order of processing Merge resulting lists Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Relevance Ranking Algorithms Relevance The likelihood that an item will fill an information need Based on documents in retrieval list Most common algorithm: TF:IDF (Term frequency : inverse document frequency) How often the query word is in the document? How often the word is in the index? Other relevance algorithms Vectors and document-query similarity Linguistic analysis and Natural Language Processing Statistical and Bayesian analysis Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Relevance Heuristics Phrase matches for multiple query terms Logs show most multi-word searches are phrases Query terms found in special sections Title Metadata Top of document All terms matched in document Even when not relevant, it’s transparent Old systems gave excess weight to single rare terms Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

More on Relevance Relevance is task-specific Results can never please all of the people More like berry-picking than like hunting Link analysis (PageRank) not very useful Intranet and site links tend to be navigational Situation-specific adjustments Some areas more likely to be valuable Current content Local content Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Federated Search and Relevance Send query to multiple search engines May require special syntax Response time often a factor Receive results in relevance order for each Display results, two options Separate sections for each search engine Merged single relevance rank list Works if all search indexes are similar Problems where the sources are very different Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Retrieval: Access Control Limit access to search itself User enters password or other credentials Search only accepts queries when authenticated Collection-level access control Query filter only retrieves items from allowed groups Hit-level access control Real-time check for user access on documents Start with most relevant documents Repeat until there are ten (may be slow) Display top results, include estimate of how many more Show helpful message if user can’t see any Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search User Experience Limit user interface complexity Show the scope of the information covered Expose query expansion and contraction Use familiar UI elements User experience goes beyond interface Index coverage Query syntax Retrieval quality and speed Relevance ranking (first ten are vital) Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search Forms Interface Balance simplicity with functionality Put a search field in the navigation bar Location should be consistent Longer is better: short fields lead to short queries Simple Search forms: limit options Zone or section Dates Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search Field Auto-Complete Dropdown menu of matching words Base on search logs Smallish list, 7-10 Most popular Simple sort Alphabetic Price or size Complete range (preferably lowestto highest) Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Other Search Interfaces Heavily researched Natural language Must keep typing Defining a questionis quite hard Interactive search Guided interviews But users want immediate results Avatars do not improve interaction Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Simple vs. Advanced Search UI Most searches are simple Short: one to three words Fewer than 10% use any operators at all (maybe 1%) Even experts prefer simple search Will use advanced tools if simple doesn’t work Default to simple search, link to advanced search Those are your power users: librarians, techies Expose all possible options Don’t spend huge resources on advanced UI Exploratory search is different Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Advanced Search Fits Sometimes EBay High motivation Complex search requirements Frequent use UX testing still required Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search Results: Page Elements Site context General page layout, navigation links Colors and design elements Results header A search field, with the current search terms Retrieval information - how many hits Results list in relevance order Each result item with at least a linked title Facets: dynamic links for filtering results Results footer Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search Results: Good Example Full but readable ,[object Object]

content blocksSite look-and-feel Navigation Familiar search results elements Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search Results: Not-So-Good Example Site page has navigation, colors: search results should too Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search Results: Visualization Fascinating to look at, great demos Star charts Topographical displays Interactive fly-throughs Hyperbolic trees Require significant resources to run Good for exploratory & comprehensive research Finding unexpected synergies Simple search is much cheaper for casual users Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search Results: Header Elements Search field, with the current query Users often edit to be more or less restrictive Number of results found A few search options Match Any Word / All Words / Exact Phase Filter by date option (if trustworthy) Search zones Results navigation Best Bets Spelling suggestions Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search Results: Hits and Pages Show number of items matched Be accurate Do not give estimates for small numbers (Google and SharePoint are bad this way) Pagination - results list navigation Helps user calibrate content Important for exploratory search Follow web search conventions, example < previous1 2 34 ... 26next > Be accurate Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search Results: “Best Bets” aka Search Suggestions, QuickLinks, KeyMatch, Recommendations Special-case links for problem queries Internal topic landing pages External sites when appropriate New and better query to search ,[object Object],Discover problems from users, log analysis “Short head” - few very popular query terms Allocate resources to keep them current ,[object Object],Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Best Bets Example Best Bets are very clear Would not come first in normal search results Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search Results: List Sorting List of links to items matching the query Sorted by matching terms Impossible to be relevant to every query Variety of sources when possible Transparency: why these items in this order Other sort orders - make very visible By author’s last name By date By price Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Search Results: Weird Sort Sorted by:“Degrees away” Labels too subtle: ,[object Object]

Result Items: Elements Information foraging: show hints about items Title of document, or name of product Location: URL, file path, database ID May need to rewrite to user-accessible URLs Hide location if it’s not meaningful Distinguishing data Metadata: picture, product code, author name Show match terms in context (snippets) Text before and after query term matches Highlight the matches Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Results Items: Additional Data Date (if reliable) Size and File type Avoid surprising launches of Acrobat or other app. Metadata Author, department, brand, product... Access status: password required? Topics and subject headings Taxonomy categories Keywords and concept tags User tags, folksonomy Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Results: Dynamic Clustering Uses search results text to infer topics Groups by similarity in titles and results text Particularly good for portals and intranets Unstructured, uncontrolled text Dynamic, no preprocessing needed Can supplement categorization and taxonomies Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Commerce and Catalog Results Picture or graphic if possible Important attributes Price Color Size Compatibility Availability “Buy” button Simplify process, save time Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Multimedia Search Image, audio, and video files Audio and visual similarity search still theory Show context in results Match terms from transcript or OCR Text around image Thumbnails or keyframes Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Results: Faceted Metadata Better than forms for structured text data Exposes attributes as part of search results Leverages metadata Topic names, taxonomy Mundane stuff: color, date, size, author... Choices specifically relating to search results Dynamically generates from metadata Preview numbers offer users confidence in clicking Supported by extensive usability testing Used on a majority of large e-commerce sites Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

No Matches Queries: Causes Misspellings and typing errors Scope problem: nothing for that topic Vocabulary differences Users may be less precise, or use competitor’s terms Marketers may dominate content Restrictive search settings Default may only match exact phrase or all words Access control may disallow user Software/hardware/network failures Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

No Matches Queries: Responses Track queries with no matches in logs Use sessions, surveys & testing to find user intent Design the no-matches page carefully Explain what is and isn’t on the site Provide useful navigation links Add search engine help Synonyms Best Bets Spelling Add terms to text Add content, topic pages Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

No Matches Queries: Spelling Issues Detect and address common problems Spelling errors Typos Queries without spaces between words Use site-specific dictionary Easy to build from search index Never suggests any words not on the site Users familiar with did you mean....? Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Empty Searches Users click or press “enter” in the search box ,[object Object],Should not find all items in the index ,[object Object],Do nothing Go to a simple search page Show an error dialog Fundamentals of Search Engines 2009 / © Avi Rappoport, www.searchtools.com

Fundamentals Of Search

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Fundamentals Of Search

Ähnlich wie Fundamentals Of Search (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Fundamentals Of Search

Hinweis der Redaktion