3. WHAT ARE WEB ROBOTS?
Web Robots (also known as Web Wanderers,
Crawlers, or Spiders), are programs that
traverse the Web automatically. Search engines
such as Google use them to index the web
content, spammers use them to scan for email
addresses, and they have many other uses.
4. WHAT IS ROBOTS.TXT?
Robots.txt is a plain text file that you upload to
the root directory of your site. Once the web
spiders (ants, bots, indexers) that index your
webpage search your site, they first look at that
text file and process it. Put differently, robots.txt
says to the spider which pages to crawl.
5. THE SIMPLEST VERSION OF ROBOTS.TXT
User-agent: *
Disallow:
The first line “user agent asterisk” indicates
that the following lines apply to all agents.
Space after "disallow:" means that nothing is
limited. This robots.txt file does nothing it
allows all types of robots to see everything on
the site.
6. SOME MORE EXAMPLES OF ROBOTS.TXT
To exclude all robots from the entire server
User-agent: *
Disallow: /
To allow all robots complete access
User-agent: *
Disallow:
(or just create an empty "/robots.txt" file, or don't use
one at all)
7. SOME MORE EXAMPLES OF ROBOTS.TXT
To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
To exclude a single robot
User-agent: BadBot
Disallow: /
8. SOME MORE EXAMPLES OF ROBOTS.TXT
To allow a single robot
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /
You can disallow single pages:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
9. SOME MORE EXAMPLES OF ROBOTS.TXT
You can specify the Sitemap location in your
robots.txt file
User-agent: *
Disallow: /
Sitemap: http://www.example.com/sitemap.xml
10. ABOUT THE ROBOTS <META> TAG
You can use a special HTML <META> tag to tell
robots not to index the content of a page, and/or
not scan it for links to follow.
<html>
<head>
<title>...</title>
<META NAME="ROBOTS"
CONTENT="NOINDEX, NOFOLLOW">
</head>
12. WHAT ARE SITEMAPS?
Tells search engines which pages are available
for crawling.
A Sitemap is an XML file that lists URLs for a
site along with additional metadata about each
URL.
when it was last updated
how often it usually changes
how important it is, relative to other URLs in the site
13. SITEMAPS XML FORMAT
The Sitemap must:
Begin with an opening <urlset> tag and end with a
closing </urlset> tag.
Specify the namespace (protocol standard) within the
<urlset> tag.
Include a <url> entry for each URL, as a parent XML
tag.
Include a <loc> child entry for each <url> parent tag.
All URLs in a Sitemap must be from a single host, such
as www.example.com or store.example.com.
Sitemap file must be UTF-8 encoded
No more than 50,000 URLs
File must not be larger than 10MB
15. USING SITEMAP INDEX FILES (TO GROUP
MULTIPLE SITEMAP FILES)
The Sitemap index file must:
Begin with an opening <sitemapindex> tag and end with a
closing </sitemapindex> tag.
Include a <sitemap> entry for each Sitemap as a parent
XML tag.
Include a <loc> child entry for each <sitemap> parent tag.
The optional <lastmod> tag is also available for Sitemap
index files.
Note: A Sitemap index file can only specify Sitemaps
that are found on the same site as the Sitemap index
file. For example,
http://www.yoursite.com/sitemap_index.xml can include
Sitemaps on http://www.yoursite.com but not on
http://www.example.com or
http://yourhost.yoursite.com.
17. SITEMAP FILE LOCATION
The location of a Sitemap file determines the
set of URLs that can be included in that
Sitemap. A Sitemap file located at
http://example.com/catalog/sitemap.xml can
include any URLs starting with
http://example.com/catalog/ but can not
include URLs starting with
http://example.com/images/.
18. THANK YOU
ADITYA TODAWAL
PROJECT COORDINATOR (SEO)
SEARCH RESULTS MEDIA – INTERNET MARKETING TORONTO