1. Describes how a basic search engine works. How a Search Engine Works Reehaz Soobhany (0920302) Strategic e-Marketing University of Mauritius 2010
2. Search Engines Introduction Everyone who uses the internet today surely uses a search engine. Several types of search engines Crawler Based (Google, Yahoo) Human Directories (Open Directory, Yahoo!Directory) Hybrid Meta Search Engine (Ask.com)
3. Crawler Based Search Engine Core Operations: Web Crawling (aka the spider) – follows every link in a page recursively and downloads the page Indexing – Creates the inverted file Searching – Searches through the inverted (indexed file according to the query of the user
4. Indexing Normalize Documents Deletes stop words Stem words Create index entries Calculate weights Updates inverted file
5. Document Normalization <H1> This is a Heading Level One </H1> Case Folding <h1> this is a heading level one </h1> Extract Core document text from file this is a heading level one
6. Delete Stop Words Stop words are words which do not have little value is finding a relevant document. Example of stop words are : A, are, is, when, how… Helps save resources and also not create to big and irrelevant indexes heading level one
7. Word Stemming & Index Entries Word stemming removes the suffixes from words Add efficiency to the index file Also match the meaning rather than the exact word inflectional suffixes (-s, -es, -ed) derivational suffixes (-ing, -able, -aciousness, -ability) headlevelone
8. Calculate Weights Usually a secret algorithm of the search engine Some typical scheme used: Placement in a document (a word in a heading level 1 will have a greater weight than one at heading level 2 or a normal text) The number of other documents which refers to this document If by authoritative writing
10. Query Processor When the user type a query in the search engine, the search engine recognises the terms and operators Runs the query against the inverted file Ranks the result. Again the secret algorithm of the search engine. Uses the weights on each word Return the results to the user. Voila