SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Mag. iur. Dr. techn. Michael Sonntag




                                 Search engines

                Methods, advertisements, website integration



                                                          Institute for Information Processing and
                                                                 Microprocessor Technology (FIM)
                                                         Johannes Kepler University Linz, Austria
      E-Mail: sonntag@fim.uni-linz.ac.at
      http://www.fim.uni-linz.ac.at/staff/sonntag.htm

© Michael Sonntag 2004
?                      ?
                                                       ?
                             Questions?
                  ?
                             Please ask immediately!

                         ?
                                                   ?
© Michael Sonntag 2004
Introduction

         How search engines work
                  Spiders/Bots
                  Indexing
                  Ranking
         Search engine spamming
                  What to avoid to receive no penalties...
         Search engines for own websites
                  Frontpage
                  Apache Lucene
                  Commercial software




Michael Sonntag                                                                  3
                                                                Search engines
Different quot;Search enginesquot;

         Crawlers: Automatically indexing the web
                  Visits all reachable pages in the Internet and indexes them
         Directories: Humans look for interesting pages
                  Manual classification: Hierarchy of topics needed
                  High quality: Everything is manually verified
                   » This takes care of a general view only (=page is on the topic it
                     states it is about)
                   » Whether the content is legal, correct, useful, etc. is NOT verified!
                  Slow: Lots of human resources required
                   » Cannot keep up with the growth of the Internet!
                  Expensive: Because of manual visits and revisits
                  Very important for special areas!
                  Now almost no importance for general use
         Mixed versions
Michael Sonntag                                                                             4
                                                                           Search engines
Spiders

         Actually creating the indices of crawler-type engines
                  Requires starting points: Entered by owners of webpages
                  Visits the webpage and indexes it, extracts all links, adds the
                  new links to the list of pages to visit
                   » Exponential growth; massively parallel; extreme internet
                     connection required; with special care distribution possible
                   » This might not find all links, e.g. links constructed by JavaScript
                     are usually not found (those mentioned in JavaScript are!)
                  Regularly revisits all pages for changes
                   » Strategies for timeframe exist
                   » Employ hashmarks/date of last change to avoid reindexing
                  Pages created through forms will not be visited!
                   » Spiders can only manage to quot;readquot; ordinary pages
                   » Filling in form is impossible (what to fill in where?)
                  Frames and image maps can also cause problems engines
Michael Sonntag                                                                            5
                                                              Search
quot;robots.txtquot;

         Allows administrators to forbid indexing/crawling of pages
         This is a single file for the whole server
                  Must be in the top-level directory!
                   » Exact name: http://<domain name>/robots.txt
                  Alternative: Specify in Meta-Tags of a page
         Robots.txt Format:
                  quot;User-agent: quot; Name of robot to restrict; use quot;*quot; for all
                  quot;Disallow: quot; Partial URL which is forbidden to visit
                   » Any URL starting with exactly this string will be omitted
                      – quot;Disallow: /helpquot; forbids quot;/help.htmquot; and quot;/help/index.htmquot;
                  quot;Allow: quot; Partual URL which may be visited
                   » Not in original standard!
                  Visit-time, Request-rate are other new directives
         Most robots actually follow this standard and respect it!
Michael Sonntag                                                                                 6
                                                                               Search engines
No-robots Meta-Tags

         Can be added into HTML pages as Meta-Tags:
                  <META NAME=quot;ROBOTSquot; CONTENT=quot;INDEX,FOLLOWquot;>
                      – Alternative: CONTENT=quot;ALLquot;
                   » Index page, also handle all linked pages
                  <META NAME=quot;ROBOTSquot; CONTENT=quot;NOINDEX,FOLLOWquot;>
                   » Do not index this page, but handle all linked pages
                  <META NAME=quot;ROBOTSquot; CONTENT=quot;INDEX,NOFOLLOWquot;>
                   » Index this page, but do not follow any links
                  <META NAME=quot;ROBOTSquot; CONTENT=quot;NOINDEX,NOFOLLOWquot;>
                      – Alternative: CONTENT=quot;NONEquot;
                   » Do not index this page and do not follow any links
                  Follow: Follow the links in the page. This is not affected by
                  the hierarchy (e.g. pages on level deper on the server)!
                  Non-HTML pages: Must use robots.txt
                   » No quot;externalquot; metadata defined!
Michael Sonntag                                                                            7
                                                                          Search engines
Indexing

         Indexing is extracting the content and storing it
                  Assigning the word to the page under which it will be found
                  later on when users are searching
         Uses similar techniques as handling actual queries
                  Stopword lists: What words do not contribute to the meaning
                   » Examples: a, an, in, the, we, you, do, and, ...
                  Word stemming: Creating a canonical form
                   » E.g. quot;wordsquot;    quot;wordquot;, quot;swimmingquot;      quot;swimquot;, ...
                  Thesaurus: Words with identical/similar meaning; synonyms
                   » Used probably only for queries!
                  Capitalization: Mostly ignored (content important, not writing)
         Some search engines also index different file types
                  E.g. Google also indexes PDF files
                  Multimedia content very rarely indexed (e.g. videos???)
Michael Sonntag                                                                             8
                                                                           Search engines
From text to index

          HTML
                             Field                                    Indexing
                                                   Metadata
                                                                                      Index
                           extraction
            PDF
                                 Content


                       Text            Tokenizer            Word             Word
                    extraction                            filtering        stemming


         Text extraction: Retrieving the plain content text
         Tokenizer: Splitting up in individual words
         Word filtering: Stop words, lowercase
         Word stemming: Removing suffixes, different forms, etc.
         Field extraction: Identifying separate parts
                  E.g. text vs. metadata
Michael Sonntag                                                                                   9
                                                                                 Search engines
Page factors

         Word frequency: A word is the more important, the more
         often it occurs on a page
                  Also scans for ALT tags of images and words in the URL
                  Modified according to the location: title, headlines, text,...
                   » Higher on the page = better
                  Clustering: How many quot;nearbyquot; pages contain the same word
                   » quot;Website themesquot;: Related webpages should be linked
                  Meta-Tags: Might be used as quot;importantquot;, just text or ignored
                  Distance between words: When searching for several words
                   » quot;gadget creatorquot; will match better than quot;creator for gadgetsquot;
         In-Link frequency: How many pages link to this page
                  Mostly those from different domain names used only!
                  Might also depend on keywords on that pages
                  The most important figure currently ( Google!)
Michael Sonntag                                                                           10
                                                                         Search engines
Page factors

         Page design: Load time, frames, HTML conformity, ...
                  Some elements cannot be handled (well), e.g. Frames
                  Size of the page (=loadtime) also has influence
                  HTML conformity is not used directly, but if parsing is not
                  possible or produces problems, the page might be ignored
         Visit frequency: If possible to determine (rare)
                  How often is the site visited through the SE?
                  How long till the user clicks on the next search result?
         Payment: Search engines also sell placement
                  Nowadays only possible with explicit marking (as paid-for)
         Update frequency: Regular updates/changes = quot;livequot; site
               Differs much between various search engines!
         Avoid spamming, this reduces the page value enormously!
Michael Sonntag                                                                       11
                                                                     Search engines
Searching

           Form            Query                  Result list
                                                                   Response
           data           analyzer                generation




           Word            Word       Searching     Sorting        Caching
         filtering       stemming

         Query analyzer: Breaking down into individual clauses
                  Clause: Terms connected by AND, OR, NEAR, ...
         Word filtering: Stop words, lowercase
         Word stemming: Removing suffixes, different forms, etc.
         Caching: For next page or refined searches

Michael Sonntag                                                                    12
                                                                  Search engines
Search engine spamming
                                                                    (1)

         Artificially trying to improve the position on the result page
                  Important: Through unfair practices!
                   » =deceiving the relevancy algorithm
         Pages decided to use spamming are heavily penalized or
         excluded completely (there is no quot;appealquot; procedure!)
         Test: Would the technique be used even if there were no
         search engine around at all?
         Examples for spamming:
                  Repetition of keywords: quot;spam, spam, spam, spamquot;
                   » Both after each other or just excessively
                  Separate pages for spiders (e.g. by user agent field)
                   » They might try retrieving the page in several ways
                  Invisible text: white (or light gray) on white
                   » Through font color, CSS, invisible layers, ...
Michael Sonntag                                                                            13
                                                                          Search engines
Search engine spamming
                                                                   (2)

         More spamming examples:
                  Misusing tags: Difficulty: What is spam and what is not?
                   » noframes, noscript, longdesc,... tags for spam content
                   » DC metadata the same
                  Very small and very long text: quot;Nearlyquot; invisible!
                  Identical pages linked to each other or mirror sites
                   » One page accessible through several URLs
                   » To create themes or as link frams (see below)
                  Excessive submissions (submission of URLs to crawl)
                   » Be careful with submission programs!
                  Meta refresh tags/300 error codes/JavaScript
                   » E.g. <body onMouseOver=quot;eval(unescape('.....'))quot;>
                   » Used to present something other to the spider (initial page) than
                     to the user (page redirected to);if required server side redirects
                  Code swapping: One page for index, later change content
Michael Sonntag                                                                           14
                                                                         Search engines
Search engine spamming
                                                                      (3)

         More spamming examples:
                  Cloaking: Returning different pages according to domain
                  name and/or IP of the requester
                   » IP adresses/names of search engine spiders are known
                  Link farms: Network of pages under different domain names
                   » Sole purpose: Creating external links through heavy cross links
                      – Graph theory used to determine them (closed group of heavily
                        interconnected sites with almost no external links)
                  Irrelevant keywords: No connection to text (e.g. quot;sexquot;)
                   » E.g. in meta tag but not in text; just to attract traffic
                  Meta refresh tags: Automatically moving to another page
                   » Used to present something other to the spider (initial page) than
                     to the user (page redirected to)
                   » If required use server side redirects
                  Doorway pages, machine generated page loops, WIKI links,...
Michael Sonntag                                                                                   15
                                                                                 Search engines
Semantic Web

         The idea is to improve the web through metadata
                  Describing the content in more detail, according to more
                  properties and relating them to each other
                  Machine understandable information on the content
                   » Danger of a new spam method!
         Allow searching not only for keywords, but also for media
         types, authors, related sites
                  Nevertheless, some parts are already possible through
                  quot;conventionalquot; search engines!
                   » The advantage would be in better certainty
                  The result would also be provably correct!
                   » But only as long as both rules and base data are correct!
         Might be useful, but is still not picking up
                  Requires site owners to add this metadata to their pages
Michael Sonntag                                                                          16
                                                                        Search engines
Commercial search engine services

         Pay for inclusion/submission: Pay to get listed
                  Available for both search engines and directories
                   » More important for directories, however!
                  Usually a flat fee, depending on the speed for inclusion
                  May depend on the content; may be recurring or once
                   » E.g. Yahoo: Ordinary site: US$ 299, quot;Adult contentquot;: US$ 600
                  Usually there is no content/ranking/review/... Guarantee
                   » Solely that it will be processed within a short time (5-7 days)!
         Pay for placement: Certain placement guaranteed
                  Now commonly paid per click: Each time a user clicks on link
                   » Previously (before dot-com crash): Pay-per-view
                  Separate from quot;ordinaryquot; links: Else legal problems possible
                  Sometimes rather rare (couldn't find ANY on Yahoo)
                   » Google: Rather common
Michael Sonntag                                                                             17
                                                                           Search engines
Pay per click (PPC)

         Advantages:
                  Low risk: Only real services (=visits) are paid for
                  Targeted visitors: Most campaigns match ads to search words
                  Measurable result: Usually tracking available
                   » To determine whether the visitor actually bought something
                  Total budget can be set
         Problems:
                  Too much/too low success: Prediction difficult
                  Requires exact knowledge of how much a visitor to the site is
                  worth to allow sensible bidding on terms
                  Click fraud: Automatic software or humans do nothing but
                  clicking on paid for links
                   » Affiliate programs making money through this, competitors to
                     exhaust your budget
Michael Sonntag                                                                         18
                                                                       Search engines
Google AdWords

         Paid placement; will show up separately on right hand side
         Cost per Click (CPC); daily upper limit can be set
                  CPC is variable within a user specified range
                   » If range to low it will not show up!
                   » Similar to bidding: The highes bidder will show up
                      – Low bidders will also show up, but only rarely: High CTR improves
         Ranking: Based on CPC and clickthrough rate (CTR)
                  Ads not clicked on will get lower!
         Online performance reports
         Targeting by language and country possible
                  Reduced quot;ad competitionquot; and enhances click rate
                  Negative keywords possible to avoid unwanted showings


Michael Sonntag                                                                              19
                                                                            Search engines
Overture SiteMatch / PrecisionMatch

         Overture: Powers Yahoo, MSN, Alltavista, AllTheWeb, ...
         SiteMatch: Paid inclusion (fast review and inclusion)
                  quot;Quality review processquot;: Probably by experts (good
                  assignment of keywords/categories) and favorably
                   » List of exclusion still applies (e.g. online gambling)
                  Pair per URL, i.e. homepage and subpages are separate
                  Pages are re-crawled every 48 hours
                  Positioning in result by relevance: No quot;moving upquot; or quot;topquot;!
                  Costs: Annual subscription (US$ 49/URL)
                   » Additional pay per click (US$ 0,15/0,3 / click)
         PrecisionMatch: Paid listing (sponsored results list)
                  Position determined by bidding
                  Pricing not available (demo: US$ 0,59 bidding value)
                   » US$ 20 minimum per month???
Michael Sonntag                                                                                20
                                                                              Search engines
Search engine integration

         Local search engine for a single site
                  Can again be of both kinds
                   » Search engine: Special software required
                      – Automatic update (re-crawling)
                      – Configurable: Visual appearence, options, methods, ...
                   » Directory: Manual creation; no special software needed (CMS)
                      – Regular manual updates required
                  Usually search engine is used
                   » Directory is the quot;normalquot; navigation structure
         Necessity for larger sites
                  Difficulty: Often special requirements needed
                   » Full-text search engine for documents
                   » Special search engine for product search
                   » Special result display for forums, blogs, ...
Michael Sonntag                                                                              21
                                                                            Search engines
Features for local search engines
                                                                 (1)

         Language suport: Word stemming, stop words, etc.
                  Also important for user interface (search results)
                  Stop words: Should be customizable
                  Spell checking: For mistyped words
         File types supported: PDF, Word, multimedia files, ....
         Configurable spider: Time of day, server load, etc.
                  Spidering through the web or on the file system level?
                  Can password-protected pages also be crawled?
                  Crawling of personalized pages?
         Search options: Boolean search, exact search, wildcards, ...
                  Quality of search: Difficult to assess, however!
                  Inclusion, exclusion, quot;nearquot; matches, phrase matching,
                  synonyms, acronyms, sound matching
Michael Sonntag                                                                         22
                                                                       Search engines
Features for local search engines
                                                                 (2)

         Admin configurability: Layout customization, user rights,
         definition of categories, file extensions to include,
         description of result items, ...
         User configurability: E.g. Results per page, history of last
         searches, descriptions shown, sub-searches, etc.
         Reports and statistics:
                  Top successful queries: What users are most interested in,
                  but cannot find easily
                  Top unsuccessful queries: What would also be of interest
                   » Or where the search engine failed
                  Referer: On which page they started to search for something
                  Top URLs: Which pages are visited most through searching
         Adheres to quot;robots.txtquot; specification?
Michael Sonntag                                                                    23
                                                                  Search engines
Features for local search engines
                                                                  (3)

         Indexing restrictions: Excluding parts form crawling/indexing
                  Internal/private pages!
         Relevancy configuration: Weight of individual elements
                  E.g. if everywhere good metadata is in, this can receive high
                  priority; title tag, links, usage statistics, custom priority, etc.
         Server based, appliance or local: Where is the engine?
         Additional features:
                  Automatic site map: Hierarchy/links from where to where
                  Automatic quot;What's newquot; list
                  Highlighting: Highlight search words in result list and/or
                  actual result pages
         quot;Add-onsquot;: Free offers usually contain advertisements

Michael Sonntag                                                                           24
                                                                         Search engines
Jakarta Lucene

         Free search engine; pure Java
                  Open source; freely available
         Features:
                  Incremental indexing
                   » Indexing only new or changed documents; can remove deleted
                     documents from index
                  Searching: Boolean and phrase queries, date-range
                   » Field searching (e.g. in quot;titlequot; or in quot;textquot;)
                   » Fuzzy queries: Small mistypings can be ignored
                  Universally usable: Searching for files in directories, web
                  page searches, offline documentation, etc.
                   » No webserver needed; also possible as stand-alone
                  Completely customizable

Michael Sonntag                                                                        25
                                                                      Search engines
Jakarta Lucene:
                                                           Missing features

          quot;Plug&Playquot;: Installing, configuring and working site search
                  Not available: Programm needed for indexing, field definition
          Complicated search options: sound (but see quot;Phonetixquot;),
          synonyms, acronyms
          Spell checking not available
          No spider component
                  Examples contain filesystem spider is basic form
                   » Problems with path differences (webserver ↔ index) possible
                  quot;robots.txtquot; not supported
          Reports/statistics must be manually programmed
          No file types supported: Example contains HTML
                  Word, PDF, etc. easily added, however!
         Not easily deployed, but good idea for special applications!
Michael Sonntag                                                                         26
                                                                       Search engines
Search Engine Optimization (SEO)

         Paid services to optimize a website for search engines
                  Usually also includes submission to many search engines
                   » How man search engines are really of any importance today?
                      – These are very few: They can be quot;fedquot; by hand also easily!
                   » Doesn't work to well for directories: Long time without payment;
                     important directories are small and specialized ones, which are
                     probably not covered
                  Often contains rank guarantees
                   » This is to be taken with very much caution: They really cannot
                     guarantee this, therefore illegal methods or spamming is used
                      – E.g. Link farms, to provide this rank once for a short time
                  First page/<50: Are users still convinced that everything
                  important is there?
                   » Many can do without top listing; especially very generic terms!

Michael Sonntag                                                                                 27
                                                                               Search engines
Promoting your website
                                                         My thoughts

         Avoid all pitfalls and search engine spamming
                  Use tools to verify suitability of webpages
                  Keywords, keyword density; HTML verification; style guides;...
         Focus on specific terms and phrases as keywords
                  Avoid quot;sexquot;, quot;carquot;, quot;computersquot;; use quot;used porsche carsquot;, ...
         Provide valuable information to visitors
                  FAQ, hints, comparisons, ...: This will get you external links
         Submit to the 5 top most crawlers and the 10 most important
         (for your business!) directories
         Wait long time to get listed before re-submitting
                  Currently some engines take up to two month
         Do not depend on visitors: Focus on customers!
                  E.g. avoid ad selling
                  Use other advertisement avenues too, if possible
Michael Sonntag                                                                        28
                                                                      Search engines
Legal aspects

         Search engine quot;optimizationquot; can lead to legal problems
         Examples are:
                  Very general terms, e.g. quot;divorceattorney.comquot;
                   » Channeling users away from competitors
                   » Decisions mixed: Sometimes allowed, sometimes not
                      – Depends also on content: Does it claim to be quot;the only onequot;?
                  Unrelated words in meta-tags
                   » E.g. using quot;legal studiesquot; for a website selling robes for judges
                      – Probably allowed as long as no channeling takes place
                  Trademarks: Using quot;foreignquot; trademarks in meta-tags
                  Liability for links to illegal content
                  Search engines and copyright

                   Discussion according to EU (and Austrian) law!
Michael Sonntag                                                                              29
                                                                            Search engines
Trademarks
         Using trademarks, service marks, common law marks, etc.
         of competitors on the own site (or in own meta-tags)
                   » This applies only to commercial websites
                  Depends on whether there is a legally acceptable reason for
                  inclusion on the webpages
                   » Example: Allowed product comparison, other suppliers, selling
                     those products, ...
                  In general this is very dangerous/forbidden
                   » Trademark law: Illegal quot;usequot; of a mark
                   » Competition law: quot;Freeridingquot; on others fame
                   » Competition law: Customer diversion
         Search engines: Results may contain the trademark
                  Can even contain links to infringing sites; see below
         Legal status not completely decided; invisibility when using
         in a meta-tag is the best argument to no liability
Michael Sonntag                                                                         30
                                                                       Search engines
Search engine liability

         Liability for the links itself depend on several elements
                  Directories provide pre-verification
                   » See link liability below!
                  Paid for links are selected; depending on selling method
                  knowledge may or may not exist by default
                   » One of the exceptions from the no-liability clause:
                      – quot;Selection or modification of contentquot;
                   » Hosting: Full liability as soon as knowledge of illegal content is
                     present; illegality must be clear even for laymen
                  Automatically gathered and presented links are privileged
                   » Only for foreign content (site search engines    full liability)
                   » This even applies if knowledge of illegal content is present!
                      – Exception: Knowlingy furthering the illegal activity
         No obligation to actively search for anything illegal!
Michael Sonntag                                                                                 31
                                                                               Search engines
Link liability

         General liability of links; also applies to directories
                  Any time a link is sonsciously created
         Liability by default if the content linked to is quot;integratedquot;
                  = made to own content (link just to avoid re-typing it)
                  Depends on the context of the link
         No liability for foreign information if
                  No knowledge of illegality of content
                   » If something illegal is in a directory, knowledge will exist through
                     pre-approval process!
                   » quot;Illegalityquot; must be obvious to laymen; suspicions insufficient
                  If knowledge attained, immediately takes step to remove link
                   » This will be quite fast as the Internet is a fast medium!
         No obligation to actively search for anything illegal!
Michael Sonntag                                                                             32
                                                                           Search engines
Search engines and copyright
                                            Image thumbnail archives

         Google also indexes images
                  This in itself is no problem
         For preview, a smaller version of the image is created and
         stored locally; this is shown alongside the link
                  Reducing an image's size is a modification of the original
                  picture and therefore requires the owners permission
                   » Unless the content is no longer discernible
                  Therefore this practice is forbidden!
         Similarity: Extracting the description of the webpage and
         showing it with the link
                  This is however a privileged use (quot;small citationquot;)
                   » Additionally this is the intended use for the meta tag!
         Currently ongoing litigation; final decision pending!
                  http://www.jurpc.de/rechtspr/20040146.pdf
Michael Sonntag                                                                            33
                                                                          Search engines
Literature

         Robot Exclusion Standard:
         http://www.conman.org/people/spc/robots2.html
         http://www.robotstxt.org/
         Jared M. Spool: Why On-Site Searching Stinks
         http://www.uie.com/articles/search_stinks/




Michael Sonntag                                                            34
                                                          Search engines

Weitere ähnliche Inhalte

Andere mochten auch

History of the internet and search engine marketing malaysia, search engine o...
History of the internet and search engine marketing malaysia, search engine o...History of the internet and search engine marketing malaysia, search engine o...
History of the internet and search engine marketing malaysia, search engine o...Ajitpal Singh
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search EnginesNitin Pande
 
Search engines and its types
Search engines and its typesSearch engines and its types
Search engines and its typesNagarjuna Kalluru
 
Internet and search engine
Internet and search engineInternet and search engine
Internet and search engineDeepak John
 
Confronting the Future--Strategic Visions for the 21st Century Public Library
Confronting the Future--Strategic Visions for the 21st Century Public LibraryConfronting the Future--Strategic Visions for the 21st Century Public Library
Confronting the Future--Strategic Visions for the 21st Century Public LibraryMarc Gartler
 
Concurrent programming in iOS
Concurrent programming in iOSConcurrent programming in iOS
Concurrent programming in iOSDongxu Yao
 
Staff study talk/ on search engine & internet in 2008
Staff study talk/ on search engine & internet in 2008Staff study talk/ on search engine & internet in 2008
Staff study talk/ on search engine & internet in 2008Sujit Chandak
 
PThreads Vs Win32 Threads
PThreads  Vs  Win32 ThreadsPThreads  Vs  Win32 Threads
PThreads Vs Win32 ThreadsRobert Sayegh
 
Knowledge Management In The Health Care Industry Antenatal Care From The Pa...
Knowledge Management In The Health Care Industry   Antenatal Care From The Pa...Knowledge Management In The Health Care Industry   Antenatal Care From The Pa...
Knowledge Management In The Health Care Industry Antenatal Care From The Pa...jessica
 
Manuele Margni, CIRAIG - Behind the Water Footprint Stream: Metrics and Initi...
Manuele Margni, CIRAIG - Behind the Water Footprint Stream: Metrics and Initi...Manuele Margni, CIRAIG - Behind the Water Footprint Stream: Metrics and Initi...
Manuele Margni, CIRAIG - Behind the Water Footprint Stream: Metrics and Initi...CWS_2010
 
Debugging and Tuning Mobile Web Sites with Modern Web Browsers
Debugging and Tuning Mobile Web Sites with Modern Web BrowsersDebugging and Tuning Mobile Web Sites with Modern Web Browsers
Debugging and Tuning Mobile Web Sites with Modern Web BrowsersTroy Miles
 
Windows 8 and the Cloud
Windows 8 and the CloudWindows 8 and the Cloud
Windows 8 and the CloudDavid Pallmann
 
When worlds Collide: HTML5 Meets the Cloud
When worlds Collide: HTML5 Meets the CloudWhen worlds Collide: HTML5 Meets the Cloud
When worlds Collide: HTML5 Meets the CloudDavid Pallmann
 
Internet Search Tips (Google)
Internet Search Tips (Google)Internet Search Tips (Google)
Internet Search Tips (Google)Lisa Hartman
 
Using the internet for search
Using the internet for searchUsing the internet for search
Using the internet for searchDr-Heba Mustafa
 
Internet Search
Internet SearchInternet Search
Internet SearchD Houseman
 

Andere mochten auch (20)

History of the internet and search engine marketing malaysia, search engine o...
History of the internet and search engine marketing malaysia, search engine o...History of the internet and search engine marketing malaysia, search engine o...
History of the internet and search engine marketing malaysia, search engine o...
 
Introduction to Search Engines
Introduction to Search EnginesIntroduction to Search Engines
Introduction to Search Engines
 
Ppt on internet
Ppt on internetPpt on internet
Ppt on internet
 
Search engines and its types
Search engines and its typesSearch engines and its types
Search engines and its types
 
Internet and search engine
Internet and search engineInternet and search engine
Internet and search engine
 
Confronting the Future--Strategic Visions for the 21st Century Public Library
Confronting the Future--Strategic Visions for the 21st Century Public LibraryConfronting the Future--Strategic Visions for the 21st Century Public Library
Confronting the Future--Strategic Visions for the 21st Century Public Library
 
Phase1 review ws-intro-2_km&julian
Phase1 review ws-intro-2_km&julianPhase1 review ws-intro-2_km&julian
Phase1 review ws-intro-2_km&julian
 
Concurrent programming in iOS
Concurrent programming in iOSConcurrent programming in iOS
Concurrent programming in iOS
 
Staff study talk/ on search engine & internet in 2008
Staff study talk/ on search engine & internet in 2008Staff study talk/ on search engine & internet in 2008
Staff study talk/ on search engine & internet in 2008
 
Pthread
PthreadPthread
Pthread
 
PThreads Vs Win32 Threads
PThreads  Vs  Win32 ThreadsPThreads  Vs  Win32 Threads
PThreads Vs Win32 Threads
 
Knowledge Management In The Health Care Industry Antenatal Care From The Pa...
Knowledge Management In The Health Care Industry   Antenatal Care From The Pa...Knowledge Management In The Health Care Industry   Antenatal Care From The Pa...
Knowledge Management In The Health Care Industry Antenatal Care From The Pa...
 
Posix Threads
Posix ThreadsPosix Threads
Posix Threads
 
Manuele Margni, CIRAIG - Behind the Water Footprint Stream: Metrics and Initi...
Manuele Margni, CIRAIG - Behind the Water Footprint Stream: Metrics and Initi...Manuele Margni, CIRAIG - Behind the Water Footprint Stream: Metrics and Initi...
Manuele Margni, CIRAIG - Behind the Water Footprint Stream: Metrics and Initi...
 
Debugging and Tuning Mobile Web Sites with Modern Web Browsers
Debugging and Tuning Mobile Web Sites with Modern Web BrowsersDebugging and Tuning Mobile Web Sites with Modern Web Browsers
Debugging and Tuning Mobile Web Sites with Modern Web Browsers
 
Windows 8 and the Cloud
Windows 8 and the CloudWindows 8 and the Cloud
Windows 8 and the Cloud
 
When worlds Collide: HTML5 Meets the Cloud
When worlds Collide: HTML5 Meets the CloudWhen worlds Collide: HTML5 Meets the Cloud
When worlds Collide: HTML5 Meets the Cloud
 
Internet Search Tips (Google)
Internet Search Tips (Google)Internet Search Tips (Google)
Internet Search Tips (Google)
 
Using the internet for search
Using the internet for searchUsing the internet for search
Using the internet for search
 
Internet Search
Internet SearchInternet Search
Internet Search
 

Ähnlich wie Internet search engine

Site Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam AudetteSite Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam AudetteAdam Audette
 
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014Bastian Grimm
 
Search Engine Optimization (Seo) for Developers
Search Engine Optimization (Seo) for DevelopersSearch Engine Optimization (Seo) for Developers
Search Engine Optimization (Seo) for DevelopersMatthew Robinson
 
SEO & IA - A findability challenge
SEO & IA - A findability challengeSEO & IA - A findability challenge
SEO & IA - A findability challengeAlberto Mucignat
 
Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008Nathan Buggia
 
Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningValeria de Paiva
 
SEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive GuideSEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive GuideAdam Audette
 
SEO Training in Hyderabad | SEO Classes in Hyderbad | SEO Coaching in Hyde...
SEO Training in Hyderabad |  SEO  Classes in Hyderbad | SEO Coaching in  Hyde...SEO Training in Hyderabad |  SEO  Classes in Hyderbad | SEO Coaching in  Hyde...
SEO Training in Hyderabad | SEO Classes in Hyderbad | SEO Coaching in Hyde...Prasad Reddy
 
Inbound Marketing Tools - SearchFest
Inbound Marketing Tools - SearchFestInbound Marketing Tools - SearchFest
Inbound Marketing Tools - SearchFestJustin Briggs
 
Demand Quest SEO Training - Session 2
Demand Quest SEO Training - Session 2Demand Quest SEO Training - Session 2
Demand Quest SEO Training - Session 2Nate Plaunt
 
Seo Beginners Slide Show
Seo Beginners Slide ShowSeo Beginners Slide Show
Seo Beginners Slide ShowTin180 VietNam
 
Demand Quest SEO training session 2
Demand Quest SEO training session 2Demand Quest SEO training session 2
Demand Quest SEO training session 2Nate Plaunt
 
Chewy Trewella - Google Searchtips
Chewy Trewella - Google SearchtipsChewy Trewella - Google Searchtips
Chewy Trewella - Google Searchtipssounddelivery
 

Ähnlich wie Internet search engine (20)

Site Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam AudetteSite Architecture Best Practices for Search Findability - Adam Audette
Site Architecture Best Practices for Search Findability - Adam Audette
 
Seo
SeoSeo
Seo
 
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014
 
Seo Manual
Seo ManualSeo Manual
Seo Manual
 
Search Engine Optimization (Seo) for Developers
Search Engine Optimization (Seo) for DevelopersSearch Engine Optimization (Seo) for Developers
Search Engine Optimization (Seo) for Developers
 
SEO & IA - A findability challenge
SEO & IA - A findability challengeSEO & IA - A findability challenge
SEO & IA - A findability challenge
 
Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008Advanced Seo Web Development Tech Ed 2008
Advanced Seo Web Development Tech Ed 2008
 
CAB 2.pptx
CAB 2.pptxCAB 2.pptx
CAB 2.pptx
 
Searchland2
Searchland2Searchland2
Searchland2
 
Charting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data MiningCharting Searchland, ACM SIG Data Mining
Charting Searchland, ACM SIG Data Mining
 
SEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive GuideSEO for Ecommerce: A Comprehensive Guide
SEO for Ecommerce: A Comprehensive Guide
 
Lvr ppt
Lvr pptLvr ppt
Lvr ppt
 
SEO Training in Hyderabad | SEO Classes in Hyderbad | SEO Coaching in Hyde...
SEO Training in Hyderabad |  SEO  Classes in Hyderbad | SEO Coaching in  Hyde...SEO Training in Hyderabad |  SEO  Classes in Hyderbad | SEO Coaching in  Hyde...
SEO Training in Hyderabad | SEO Classes in Hyderbad | SEO Coaching in Hyde...
 
Seo report
Seo reportSeo report
Seo report
 
Inbound Marketing Tools - SearchFest
Inbound Marketing Tools - SearchFestInbound Marketing Tools - SearchFest
Inbound Marketing Tools - SearchFest
 
Demand Quest SEO Training - Session 2
Demand Quest SEO Training - Session 2Demand Quest SEO Training - Session 2
Demand Quest SEO Training - Session 2
 
Seo Beginners Slide Show
Seo Beginners Slide ShowSeo Beginners Slide Show
Seo Beginners Slide Show
 
Foxtail Website Audit
Foxtail Website AuditFoxtail Website Audit
Foxtail Website Audit
 
Demand Quest SEO training session 2
Demand Quest SEO training session 2Demand Quest SEO training session 2
Demand Quest SEO training session 2
 
Chewy Trewella - Google Searchtips
Chewy Trewella - Google SearchtipsChewy Trewella - Google Searchtips
Chewy Trewella - Google Searchtips
 

Kürzlich hochgeladen

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Internet search engine

  • 1. Mag. iur. Dr. techn. Michael Sonntag Search engines Methods, advertisements, website integration Institute for Information Processing and Microprocessor Technology (FIM) Johannes Kepler University Linz, Austria E-Mail: sonntag@fim.uni-linz.ac.at http://www.fim.uni-linz.ac.at/staff/sonntag.htm © Michael Sonntag 2004
  • 2. ? ? ? Questions? ? Please ask immediately! ? ? © Michael Sonntag 2004
  • 3. Introduction How search engines work Spiders/Bots Indexing Ranking Search engine spamming What to avoid to receive no penalties... Search engines for own websites Frontpage Apache Lucene Commercial software Michael Sonntag 3 Search engines
  • 4. Different quot;Search enginesquot; Crawlers: Automatically indexing the web Visits all reachable pages in the Internet and indexes them Directories: Humans look for interesting pages Manual classification: Hierarchy of topics needed High quality: Everything is manually verified » This takes care of a general view only (=page is on the topic it states it is about) » Whether the content is legal, correct, useful, etc. is NOT verified! Slow: Lots of human resources required » Cannot keep up with the growth of the Internet! Expensive: Because of manual visits and revisits Very important for special areas! Now almost no importance for general use Mixed versions Michael Sonntag 4 Search engines
  • 5. Spiders Actually creating the indices of crawler-type engines Requires starting points: Entered by owners of webpages Visits the webpage and indexes it, extracts all links, adds the new links to the list of pages to visit » Exponential growth; massively parallel; extreme internet connection required; with special care distribution possible » This might not find all links, e.g. links constructed by JavaScript are usually not found (those mentioned in JavaScript are!) Regularly revisits all pages for changes » Strategies for timeframe exist » Employ hashmarks/date of last change to avoid reindexing Pages created through forms will not be visited! » Spiders can only manage to quot;readquot; ordinary pages » Filling in form is impossible (what to fill in where?) Frames and image maps can also cause problems engines Michael Sonntag 5 Search
  • 6. quot;robots.txtquot; Allows administrators to forbid indexing/crawling of pages This is a single file for the whole server Must be in the top-level directory! » Exact name: http://<domain name>/robots.txt Alternative: Specify in Meta-Tags of a page Robots.txt Format: quot;User-agent: quot; Name of robot to restrict; use quot;*quot; for all quot;Disallow: quot; Partial URL which is forbidden to visit » Any URL starting with exactly this string will be omitted – quot;Disallow: /helpquot; forbids quot;/help.htmquot; and quot;/help/index.htmquot; quot;Allow: quot; Partual URL which may be visited » Not in original standard! Visit-time, Request-rate are other new directives Most robots actually follow this standard and respect it! Michael Sonntag 6 Search engines
  • 7. No-robots Meta-Tags Can be added into HTML pages as Meta-Tags: <META NAME=quot;ROBOTSquot; CONTENT=quot;INDEX,FOLLOWquot;> – Alternative: CONTENT=quot;ALLquot; » Index page, also handle all linked pages <META NAME=quot;ROBOTSquot; CONTENT=quot;NOINDEX,FOLLOWquot;> » Do not index this page, but handle all linked pages <META NAME=quot;ROBOTSquot; CONTENT=quot;INDEX,NOFOLLOWquot;> » Index this page, but do not follow any links <META NAME=quot;ROBOTSquot; CONTENT=quot;NOINDEX,NOFOLLOWquot;> – Alternative: CONTENT=quot;NONEquot; » Do not index this page and do not follow any links Follow: Follow the links in the page. This is not affected by the hierarchy (e.g. pages on level deper on the server)! Non-HTML pages: Must use robots.txt » No quot;externalquot; metadata defined! Michael Sonntag 7 Search engines
  • 8. Indexing Indexing is extracting the content and storing it Assigning the word to the page under which it will be found later on when users are searching Uses similar techniques as handling actual queries Stopword lists: What words do not contribute to the meaning » Examples: a, an, in, the, we, you, do, and, ... Word stemming: Creating a canonical form » E.g. quot;wordsquot; quot;wordquot;, quot;swimmingquot; quot;swimquot;, ... Thesaurus: Words with identical/similar meaning; synonyms » Used probably only for queries! Capitalization: Mostly ignored (content important, not writing) Some search engines also index different file types E.g. Google also indexes PDF files Multimedia content very rarely indexed (e.g. videos???) Michael Sonntag 8 Search engines
  • 9. From text to index HTML Field Indexing Metadata Index extraction PDF Content Text Tokenizer Word Word extraction filtering stemming Text extraction: Retrieving the plain content text Tokenizer: Splitting up in individual words Word filtering: Stop words, lowercase Word stemming: Removing suffixes, different forms, etc. Field extraction: Identifying separate parts E.g. text vs. metadata Michael Sonntag 9 Search engines
  • 10. Page factors Word frequency: A word is the more important, the more often it occurs on a page Also scans for ALT tags of images and words in the URL Modified according to the location: title, headlines, text,... » Higher on the page = better Clustering: How many quot;nearbyquot; pages contain the same word » quot;Website themesquot;: Related webpages should be linked Meta-Tags: Might be used as quot;importantquot;, just text or ignored Distance between words: When searching for several words » quot;gadget creatorquot; will match better than quot;creator for gadgetsquot; In-Link frequency: How many pages link to this page Mostly those from different domain names used only! Might also depend on keywords on that pages The most important figure currently ( Google!) Michael Sonntag 10 Search engines
  • 11. Page factors Page design: Load time, frames, HTML conformity, ... Some elements cannot be handled (well), e.g. Frames Size of the page (=loadtime) also has influence HTML conformity is not used directly, but if parsing is not possible or produces problems, the page might be ignored Visit frequency: If possible to determine (rare) How often is the site visited through the SE? How long till the user clicks on the next search result? Payment: Search engines also sell placement Nowadays only possible with explicit marking (as paid-for) Update frequency: Regular updates/changes = quot;livequot; site Differs much between various search engines! Avoid spamming, this reduces the page value enormously! Michael Sonntag 11 Search engines
  • 12. Searching Form Query Result list Response data analyzer generation Word Word Searching Sorting Caching filtering stemming Query analyzer: Breaking down into individual clauses Clause: Terms connected by AND, OR, NEAR, ... Word filtering: Stop words, lowercase Word stemming: Removing suffixes, different forms, etc. Caching: For next page or refined searches Michael Sonntag 12 Search engines
  • 13. Search engine spamming (1) Artificially trying to improve the position on the result page Important: Through unfair practices! » =deceiving the relevancy algorithm Pages decided to use spamming are heavily penalized or excluded completely (there is no quot;appealquot; procedure!) Test: Would the technique be used even if there were no search engine around at all? Examples for spamming: Repetition of keywords: quot;spam, spam, spam, spamquot; » Both after each other or just excessively Separate pages for spiders (e.g. by user agent field) » They might try retrieving the page in several ways Invisible text: white (or light gray) on white » Through font color, CSS, invisible layers, ... Michael Sonntag 13 Search engines
  • 14. Search engine spamming (2) More spamming examples: Misusing tags: Difficulty: What is spam and what is not? » noframes, noscript, longdesc,... tags for spam content » DC metadata the same Very small and very long text: quot;Nearlyquot; invisible! Identical pages linked to each other or mirror sites » One page accessible through several URLs » To create themes or as link frams (see below) Excessive submissions (submission of URLs to crawl) » Be careful with submission programs! Meta refresh tags/300 error codes/JavaScript » E.g. <body onMouseOver=quot;eval(unescape('.....'))quot;> » Used to present something other to the spider (initial page) than to the user (page redirected to);if required server side redirects Code swapping: One page for index, later change content Michael Sonntag 14 Search engines
  • 15. Search engine spamming (3) More spamming examples: Cloaking: Returning different pages according to domain name and/or IP of the requester » IP adresses/names of search engine spiders are known Link farms: Network of pages under different domain names » Sole purpose: Creating external links through heavy cross links – Graph theory used to determine them (closed group of heavily interconnected sites with almost no external links) Irrelevant keywords: No connection to text (e.g. quot;sexquot;) » E.g. in meta tag but not in text; just to attract traffic Meta refresh tags: Automatically moving to another page » Used to present something other to the spider (initial page) than to the user (page redirected to) » If required use server side redirects Doorway pages, machine generated page loops, WIKI links,... Michael Sonntag 15 Search engines
  • 16. Semantic Web The idea is to improve the web through metadata Describing the content in more detail, according to more properties and relating them to each other Machine understandable information on the content » Danger of a new spam method! Allow searching not only for keywords, but also for media types, authors, related sites Nevertheless, some parts are already possible through quot;conventionalquot; search engines! » The advantage would be in better certainty The result would also be provably correct! » But only as long as both rules and base data are correct! Might be useful, but is still not picking up Requires site owners to add this metadata to their pages Michael Sonntag 16 Search engines
  • 17. Commercial search engine services Pay for inclusion/submission: Pay to get listed Available for both search engines and directories » More important for directories, however! Usually a flat fee, depending on the speed for inclusion May depend on the content; may be recurring or once » E.g. Yahoo: Ordinary site: US$ 299, quot;Adult contentquot;: US$ 600 Usually there is no content/ranking/review/... Guarantee » Solely that it will be processed within a short time (5-7 days)! Pay for placement: Certain placement guaranteed Now commonly paid per click: Each time a user clicks on link » Previously (before dot-com crash): Pay-per-view Separate from quot;ordinaryquot; links: Else legal problems possible Sometimes rather rare (couldn't find ANY on Yahoo) » Google: Rather common Michael Sonntag 17 Search engines
  • 18. Pay per click (PPC) Advantages: Low risk: Only real services (=visits) are paid for Targeted visitors: Most campaigns match ads to search words Measurable result: Usually tracking available » To determine whether the visitor actually bought something Total budget can be set Problems: Too much/too low success: Prediction difficult Requires exact knowledge of how much a visitor to the site is worth to allow sensible bidding on terms Click fraud: Automatic software or humans do nothing but clicking on paid for links » Affiliate programs making money through this, competitors to exhaust your budget Michael Sonntag 18 Search engines
  • 19. Google AdWords Paid placement; will show up separately on right hand side Cost per Click (CPC); daily upper limit can be set CPC is variable within a user specified range » If range to low it will not show up! » Similar to bidding: The highes bidder will show up – Low bidders will also show up, but only rarely: High CTR improves Ranking: Based on CPC and clickthrough rate (CTR) Ads not clicked on will get lower! Online performance reports Targeting by language and country possible Reduced quot;ad competitionquot; and enhances click rate Negative keywords possible to avoid unwanted showings Michael Sonntag 19 Search engines
  • 20. Overture SiteMatch / PrecisionMatch Overture: Powers Yahoo, MSN, Alltavista, AllTheWeb, ... SiteMatch: Paid inclusion (fast review and inclusion) quot;Quality review processquot;: Probably by experts (good assignment of keywords/categories) and favorably » List of exclusion still applies (e.g. online gambling) Pair per URL, i.e. homepage and subpages are separate Pages are re-crawled every 48 hours Positioning in result by relevance: No quot;moving upquot; or quot;topquot;! Costs: Annual subscription (US$ 49/URL) » Additional pay per click (US$ 0,15/0,3 / click) PrecisionMatch: Paid listing (sponsored results list) Position determined by bidding Pricing not available (demo: US$ 0,59 bidding value) » US$ 20 minimum per month??? Michael Sonntag 20 Search engines
  • 21. Search engine integration Local search engine for a single site Can again be of both kinds » Search engine: Special software required – Automatic update (re-crawling) – Configurable: Visual appearence, options, methods, ... » Directory: Manual creation; no special software needed (CMS) – Regular manual updates required Usually search engine is used » Directory is the quot;normalquot; navigation structure Necessity for larger sites Difficulty: Often special requirements needed » Full-text search engine for documents » Special search engine for product search » Special result display for forums, blogs, ... Michael Sonntag 21 Search engines
  • 22. Features for local search engines (1) Language suport: Word stemming, stop words, etc. Also important for user interface (search results) Stop words: Should be customizable Spell checking: For mistyped words File types supported: PDF, Word, multimedia files, .... Configurable spider: Time of day, server load, etc. Spidering through the web or on the file system level? Can password-protected pages also be crawled? Crawling of personalized pages? Search options: Boolean search, exact search, wildcards, ... Quality of search: Difficult to assess, however! Inclusion, exclusion, quot;nearquot; matches, phrase matching, synonyms, acronyms, sound matching Michael Sonntag 22 Search engines
  • 23. Features for local search engines (2) Admin configurability: Layout customization, user rights, definition of categories, file extensions to include, description of result items, ... User configurability: E.g. Results per page, history of last searches, descriptions shown, sub-searches, etc. Reports and statistics: Top successful queries: What users are most interested in, but cannot find easily Top unsuccessful queries: What would also be of interest » Or where the search engine failed Referer: On which page they started to search for something Top URLs: Which pages are visited most through searching Adheres to quot;robots.txtquot; specification? Michael Sonntag 23 Search engines
  • 24. Features for local search engines (3) Indexing restrictions: Excluding parts form crawling/indexing Internal/private pages! Relevancy configuration: Weight of individual elements E.g. if everywhere good metadata is in, this can receive high priority; title tag, links, usage statistics, custom priority, etc. Server based, appliance or local: Where is the engine? Additional features: Automatic site map: Hierarchy/links from where to where Automatic quot;What's newquot; list Highlighting: Highlight search words in result list and/or actual result pages quot;Add-onsquot;: Free offers usually contain advertisements Michael Sonntag 24 Search engines
  • 25. Jakarta Lucene Free search engine; pure Java Open source; freely available Features: Incremental indexing » Indexing only new or changed documents; can remove deleted documents from index Searching: Boolean and phrase queries, date-range » Field searching (e.g. in quot;titlequot; or in quot;textquot;) » Fuzzy queries: Small mistypings can be ignored Universally usable: Searching for files in directories, web page searches, offline documentation, etc. » No webserver needed; also possible as stand-alone Completely customizable Michael Sonntag 25 Search engines
  • 26. Jakarta Lucene: Missing features quot;Plug&Playquot;: Installing, configuring and working site search Not available: Programm needed for indexing, field definition Complicated search options: sound (but see quot;Phonetixquot;), synonyms, acronyms Spell checking not available No spider component Examples contain filesystem spider is basic form » Problems with path differences (webserver ↔ index) possible quot;robots.txtquot; not supported Reports/statistics must be manually programmed No file types supported: Example contains HTML Word, PDF, etc. easily added, however! Not easily deployed, but good idea for special applications! Michael Sonntag 26 Search engines
  • 27. Search Engine Optimization (SEO) Paid services to optimize a website for search engines Usually also includes submission to many search engines » How man search engines are really of any importance today? – These are very few: They can be quot;fedquot; by hand also easily! » Doesn't work to well for directories: Long time without payment; important directories are small and specialized ones, which are probably not covered Often contains rank guarantees » This is to be taken with very much caution: They really cannot guarantee this, therefore illegal methods or spamming is used – E.g. Link farms, to provide this rank once for a short time First page/<50: Are users still convinced that everything important is there? » Many can do without top listing; especially very generic terms! Michael Sonntag 27 Search engines
  • 28. Promoting your website My thoughts Avoid all pitfalls and search engine spamming Use tools to verify suitability of webpages Keywords, keyword density; HTML verification; style guides;... Focus on specific terms and phrases as keywords Avoid quot;sexquot;, quot;carquot;, quot;computersquot;; use quot;used porsche carsquot;, ... Provide valuable information to visitors FAQ, hints, comparisons, ...: This will get you external links Submit to the 5 top most crawlers and the 10 most important (for your business!) directories Wait long time to get listed before re-submitting Currently some engines take up to two month Do not depend on visitors: Focus on customers! E.g. avoid ad selling Use other advertisement avenues too, if possible Michael Sonntag 28 Search engines
  • 29. Legal aspects Search engine quot;optimizationquot; can lead to legal problems Examples are: Very general terms, e.g. quot;divorceattorney.comquot; » Channeling users away from competitors » Decisions mixed: Sometimes allowed, sometimes not – Depends also on content: Does it claim to be quot;the only onequot;? Unrelated words in meta-tags » E.g. using quot;legal studiesquot; for a website selling robes for judges – Probably allowed as long as no channeling takes place Trademarks: Using quot;foreignquot; trademarks in meta-tags Liability for links to illegal content Search engines and copyright Discussion according to EU (and Austrian) law! Michael Sonntag 29 Search engines
  • 30. Trademarks Using trademarks, service marks, common law marks, etc. of competitors on the own site (or in own meta-tags) » This applies only to commercial websites Depends on whether there is a legally acceptable reason for inclusion on the webpages » Example: Allowed product comparison, other suppliers, selling those products, ... In general this is very dangerous/forbidden » Trademark law: Illegal quot;usequot; of a mark » Competition law: quot;Freeridingquot; on others fame » Competition law: Customer diversion Search engines: Results may contain the trademark Can even contain links to infringing sites; see below Legal status not completely decided; invisibility when using in a meta-tag is the best argument to no liability Michael Sonntag 30 Search engines
  • 31. Search engine liability Liability for the links itself depend on several elements Directories provide pre-verification » See link liability below! Paid for links are selected; depending on selling method knowledge may or may not exist by default » One of the exceptions from the no-liability clause: – quot;Selection or modification of contentquot; » Hosting: Full liability as soon as knowledge of illegal content is present; illegality must be clear even for laymen Automatically gathered and presented links are privileged » Only for foreign content (site search engines full liability) » This even applies if knowledge of illegal content is present! – Exception: Knowlingy furthering the illegal activity No obligation to actively search for anything illegal! Michael Sonntag 31 Search engines
  • 32. Link liability General liability of links; also applies to directories Any time a link is sonsciously created Liability by default if the content linked to is quot;integratedquot; = made to own content (link just to avoid re-typing it) Depends on the context of the link No liability for foreign information if No knowledge of illegality of content » If something illegal is in a directory, knowledge will exist through pre-approval process! » quot;Illegalityquot; must be obvious to laymen; suspicions insufficient If knowledge attained, immediately takes step to remove link » This will be quite fast as the Internet is a fast medium! No obligation to actively search for anything illegal! Michael Sonntag 32 Search engines
  • 33. Search engines and copyright Image thumbnail archives Google also indexes images This in itself is no problem For preview, a smaller version of the image is created and stored locally; this is shown alongside the link Reducing an image's size is a modification of the original picture and therefore requires the owners permission » Unless the content is no longer discernible Therefore this practice is forbidden! Similarity: Extracting the description of the webpage and showing it with the link This is however a privileged use (quot;small citationquot;) » Additionally this is the intended use for the meta tag! Currently ongoing litigation; final decision pending! http://www.jurpc.de/rechtspr/20040146.pdf Michael Sonntag 33 Search engines
  • 34. Literature Robot Exclusion Standard: http://www.conman.org/people/spc/robots2.html http://www.robotstxt.org/ Jared M. Spool: Why On-Site Searching Stinks http://www.uie.com/articles/search_stinks/ Michael Sonntag 34 Search engines