2. Introduction
• In the early days of Internet
Rise of Anonymous FTP sites
It download the files needed
The first search engine ::
ARCHIE
Created in 1990,downloaded directory listings of
all files on anonymous FTP sites, and created
searchable database.
3. Google
Became popular around 2001
Important concepts of “ link popularity” and
“page rank” were introduced.
Yahoo!
Prior to 2004, Yahoo! Used Google to provide
users with search results.
Launched its own search engine in 2004.
Used technologies used in Inktomi and AltaVista,
which Yahoo! Acquired.
4. MSN Search :
Most recent search engine, owned by
Microsoft.
Increasing in popularity
Windows live search --- a new search
platform.
5. Search Engine Defined
“It is a software program that helps in
locating information stored on a
computer system, typically on world
wide web.”
They are of two types :
I. Crawler Based
II. Human Powered
6. Crawler Based Search
Engines
• Create their listings Automatically
e.g. GOOGLE, YAHOO
• crawl or spider the web to create a
directory of information.
• When “changes” are made to a page
Such search engines will find these
changes automatically.
7. • Human-powered Directories
Depend on humans for the creation of
directory
• Hybrid Search Engines
Can accept both types of results
Based on web crawlers
Based on human-powered listings
8. What is WebCrawler
basically?
A single piece of software ,with
two different functions
Building indexes of web pages.
Navigate the web automatically on demand.
9. KEY DESIGN GOALS
Content-based indexing.
Breath first search to create a broad index.
Crawler behavior to include as many as
web servers as possible.
10. Components in WebCrawler
retrieving documents from the web
under the control of search engine =>
front end for Crawler
Start with the known
set of documents
access contents using
different protocol
handling the query
processing service
document metadata
hyperlinks
11. Web viewed as a Graph
Web site
Main page
pointers
Sub pages
NODE
12. Algorithm
•
•
•
•
Select a URL from the set of candidates
Download the associated web pages
Extract the URL’s contained therein
Add those URL’s that have not been
encountered before the candidate set
15. Performance and Reliability
considerations
• Need to fetch many pages at same time
– utilize the network bandwidth
• Highly concurrent and parallelized DNS lookups
• Use of asynchronous sockets
– Polling socket to check for completion of network
transfers
– Multi-processing or multi-threading
• Care in URL extraction
– Eliminating duplicates to reduce redundant fetches
16. WebCrawler : Indexing Mode
• Try and build an index of as much of the web as
possible.
• Some heuristics used :
– Which documents to select if the space for storing
indices is limited? (eg. SAVE 100 pages)
• A reasonable approach is to ensure that
documents come from as many different servers
as possible.
• WebCrawler uses a modified breath first search
approach in order to ensure that every server has
at least one document that has been indexed.
17. WebCrawler : Real-time
Search
• Basic motivation :
Given a user’s query, try to find documents
that most closely matches.
A different search algorithm is used here by
WebCrawler.
Intuitive reasoning :
– If we follow the links from a document that is
similar to what the user is looking for , they
will most likely lead to relevant documents.
18. Applications
• Search Engine Indexing
• Statistical Analysis
• Maintenance of Hypertext Structure
(URL , Links Validation)
• Resource Discovery
• Attributer
– A service that mines web for Copyright
violations