SlideShare ist ein Scribd-Unternehmen logo
1 von 4
Downloaden Sie, um offline zu lesen
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 6 (Nov. - Dec. 2013), PP 34-37
www.iosrjournals.org
www.iosrjournals.org 34 | Page
Design Issues for Search Engines and Web Crawlers: A Review
Deepak Kumar, Aditya Kumar
WCTM, Gurgaon, Haryana,
Abstract: The World Wide Web is a huge source of hyperlinked information contained in hypertext documents.
Search engines use web crawlers to collect these web documents from web for storage and indexing. The prompt
growth of the World Wide Web has posed incomparable challenges for the designers of search engines and web
crawlers; that help users to retrieve web pages in a reasonable amount of time. In this paper, a review on need
and working of a search engine, and role of a web crawler is being presented.
Key words: Internet, www, search engine, types, design issues, web crawlers.
I. Introduction
Today Internet [1,5] has become the universal source of information. Within a span of few years, it has
changed the way we business and communicate. It has given a worldwide platform to us. The World Wide Web
(WWW or web) [1, 2, 4, 12, 13] is a hyperlinked repository of hypertext documents lying in different websites
distributed over far end distant geographical locations.
There is unbelievable growth in the number of web documents and it is estimated that currently there
are millions of web servers around the world hosting trillions of web pages. With such a huge web repository,
finding the right information at the right time is a very challenging task. Hence, there is a need for a more
effective way of retrieving information from the web.
Search engines [3,7,8,9,11,12,13,14] operate as a link between web users and web documents. Without
search engines, this vast source of information in web pages remain veiled for us. A search engine is a
searchable database which collects information from web pages on the Internet, indexes the information and then
stores the result in a huge database where from it can be searched quickly.
II. Search Engines
A general web search engine (Fig. 1) has three parts [3,10,11,12,13,14] i.e. Crawler, Indexer and Query
engine. The web crawler (also called robot, spider, worm, walker or wanderer) is a module that searches the web
pages from the web world. These are small programs that peruse the web on the search engine's behalf, and
follow links to reach different pages. Starting with a set of seed URLs, crawlers extract URLs appearing in the
retrieved pages, and store pages in a repository database.
The indexer extracts all the words from each page and records the URL where each word has occurred.
The result is stored in a large table containing URLs; pointing to pages in the repository where a given word
occurs.
Indexing methods used in web database creation are full text indexing, keyword indexing and human
indexing. Full text indexing is where every word on the page is put into a database for searching. It helps user to
find every example in response to a specific name or terminology. A general topic search will not be very useful
in the database and one has to dig through a lot of false drops.
In keyword indexing only important words or phrases are put into a database. In human indexing a
person examines the page and determines a very few key phrases that describes it. This allows the user to find a
good start of works on a topic, assuming that the topic was picked by the human as something that describes the
page.
The query engine is responsible for receiving and filling search requests from users. It relies on the
indexes and on the repository. Because of the web's size, and the fact that users typically only enter one or two
keywords, result sets are usually very large.
Fig. 1 : Architecture of a typical web search engine
Design Issues for Search Engines and Web Crawlers: A Review
www.iosrjournals.org 35 | Page
2.1 Types of Search Engines
Search engines are good at finding unique keywords, phrases, quotes, and information contained in the
full text of web pages. Search engines allow user to enter keywords and then search for them in its table
followed by database. Various types [9] of search engines available are Crawler based search engines, Human
powered directories, Meta search engines and Hybrid search engines.
i. Crawler based search engines :
Crawler based search engines create their listings (that make up the search engine's index or catalog)
automatically with the help of web crawlers. It uses a computer algorithm to rank all pages retrieved. Such
search engines are huge and often retrieve a lot of information. For complex searches, it allows to search within
the results of previous search and enables us to refine search results. Such types of search engines contain full
text of web pages they link to. Here, one can find pages by matching words in the pages user wants.
ii. Human powered directories :
Human powered directories are built by human selection i.e. they depend on humans to create
repository. They are organized into subject categories and classification of pages is done by subjects. Such
directories do not contain full text of the web page they link to, and are smaller than most search engines.
iii. Hybrid search engine :
Hybrid search engines differ from traditional text oriented and directory based search engines. These
search engines typically favor one type of listing over the other. Many search engines today combine a crawler
based search engine with a directory service.
iv. Meta search engines :
Meta search engines accumulate search and screen the results of multiple primary search engines.
Unlike search engines, meta crawlers don't crawl the web themselves to build listings. Instead, they allow
searches to be sent to several search engines all at once. The results are then blended together onto one page.
Based on the application for which search engines are used, they can be categorized as follows:-
i. Primary search engines scan entire sections of the www and produce their results from databases of web
page content, automatically created by computers.
ii. Business and Services search engines essentially National yellow page directories.
iii. Employment and Job search engines either provide potential employers access to resumes of people
interested in working for them or provide prospective employees with information on job availability.
iv. Finance-oriented search engines facilitate searches for specific information about companies (officers,
annual reports etc.).
v. Image search engines help us search the www for images of all kinds.
vi. News search engines search newspaper and news web site archives for the selected information.
vii. People search engines search for names, addresses, telephone numbers and e-mail addresses.
viii. Subject guides are like indexes in the back of a book. They involve human intervention in selecting and
organizing resources, so they cover fewer resources and topics but provide more focus and guidance.
ix. Specialized search engines search specialized databases, allow users to enter search terms in a particularly
easy way, look for low prices on items they are interested in purchasing, and even give users access to real,
live human beings to answer questions.
2.2 Design issues
Designing a search engine [7,8,9,10,11] is a tricky task and there are differences in the ways they work.
In common they all perform three basic tasks; search the Internet based on important words, keep an index of the
words they find and where they find them, and allow users to look for words (or combinations of words) found
in that index.
There are different ways to improve performance of search engines but three main characteristics are
improving algorithms to search the web, using filters towards the user’s results; and improving the user interface
for query input. Factors that determine the quality of a search engine are freshness of contents, index quality,
search features, retrieval system and user behaviour. Few more issues are :-
i. Primary goals:
The primary goal of a search engine is to provide high quality search results over a rapidly growing
world wide web. The most important appraise for a search engine is its search performance, quality of the
results, ability to crawl and indexing the web efficiently.
Design Issues for Search Engines and Web Crawlers: A Review
www.iosrjournals.org 36 | Page
ii. Diversity of documents:
On the web, documents are written in several different languages. Documents need to be indexed in a
way which allows it to search for documents written in diverse languages with just one query. Search engines
should be able to index documents written in multiple formats, as each file format provides certain difficulties
for the search engines.
iii. Behaviour of web users:
Web users are very heterogeneous and search engines are used by professionals as well as by laymen.
The search engine needs to be enough smart to cater for both types of users. Moreover, there is a tendency that
users often only look at the results set that can be seen without scrolling, and the results which are not on first
page are nearly invisible for the general user.
iv. Freshness of database :
Search engines find problems in keeping its database up-to-date with the entire web because of its
enormous size and the different update cycles of individual websites.
Other characteristics [10] that a large search engine is expected to have are scalability, high
performance, politeness, continuity, extensibility and portability.
III. Web Crawlers
Web crawler (Fig. 2) is a program that fetches information from www in an automated manner. The
objective of a web crawler is to maintain freshness [6] of pages in its collection as high as possible. To download
a document, a crawler starts by placing an initial set of seed URLs in a queue, where all URLs to be retrieved are
kept and prioritized. From this queue the crawler extracts a URL, downloads the page, extracts URLs from the
downloaded page, and places the new URLs in the queue. This process is repeated and the collected pages are
used by a search engine. The browser parses the document and makes it available to the user.
The general algorithm of a web crawler is given below :
Begin
Read a URL from the set of seed URLs;
Determine the IP address for the host name;
Download the Robot.txt file which carries downloading permissions and also specifies the files to be excluded
by the crawler;
Determine the protocol of underlying host like http, ftp, gopher etc.;
Based on the protocol of the host, download the document;
Identify the document format like doc, html, or pdf etc.;
Check whether the document has already been downloaded or not;
If the document is fresh one then
Read it and extract the links or references to the other cites from that documents;
else
Continue;
Convert the URL links into their absolute URL equivalents;
Add the URLs to set of seed URLs;
End.
A web crawler deals with two main issues, a good crawling strategy for deciding which pages to download next
and, to have a highly optimized system architecture that can download a large number of pages per second. On
other side it has to be robust against crashes, manageable, and considerate of resources and web servers.
Fig. 3.1 : General architecture of a web crawler
Design Issues for Search Engines and Web Crawlers: A Review
www.iosrjournals.org 37 | Page
Several available web crawling techniques are Parallel crawler, Distributed crawler, Focussed crawler, Hidden
crawler, Incremental crawler, that differs in their mechanism, implementation and objective.
IV. Conclusion and future research
In this paper we conclude that designing a search engine to crawl the complete web is a nontrivial task.
They have to be enough committed to perform the crawling process efficiently and reliably. Besides some design
issues of a search engine, we also discussed that a web crawler supports the search engine to update its database
to improve quality of contents retrieved by the user. From several available web crawling techniques,
appropriate one may be used to crawl the web while designing a search engine.
References
[1]. A.K. Sharma, J. P. Gupta, D. P. Agarwal, “Augment Hypertext Documents suitable for parallel crawlers” , accepted for presentation
and inclusion in the proceedings of WITSA-2003, a National workshop on Information Technology Services and Applications,
Feb’2003, New Delhi.
[2]. Alexandros Ntoulas, Junghoo Cho, Christopher Olston, "What's New on the Web ? The Evolution of the Web from a Search Engine
Perspective", In Proceedings of the World-Wide Web Conference (WWW), May 2004.
[3]. Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan, "Searching the Web", ACM Transactions
on Internet Technology, 1(1): August 2001.
[4]. Baldi, Pierre, “Modeling the Internet and the Web: Probabilistic Methods and Algorithms”, 2003.
[5]. Barry M. Leiner, Vinton G. Cerf, David D. Clark, Robert E. Kahn, Leonard, Kleinrock, Daniel C. Lynch, Jon
[6]. Postel, Larry G. Roberts, Stephen Wolff, “A Brief History of the Internet”, www.isoc.org/internet/his tory,
[7]. Brian E. Brewington, George Cybenko, “How dynamic i s the web.”, In Proceedings of the Ninth Internationa l World-Wide Web
Conference, Amsterdam, Netherlands, May 2000.
[8]. Dirk Lewandowski, “Web searching, search engines an d Information Retrieval, Information Services & Use”, 25 (2005) 137-147,
IOS Press, 2005.
[9]. Franklin, Curt, “How Internet Search Engines Work”, 2002, www.howstuffworks.com.
[10]. Grossan. B, “Search Engines : What they are, how th ey work, and practical suggestions for getting the most out of them”, Feb97,
webreference.com.
[11]. Heydon A., Najork M., “Mercator: A scalable, extens ible Web crawler.”, World Wide Web, vol. 2, no. 4, pp. 2 19-229, 1999.
[12]. Mark Najork, Allan Heydon, “High- Performance Web Crawling”, September 2001.
[13]. Mike, Burner, “Crawling towards Eternity : Building an archive of the World Wide Web”, Web Techniques Magazine, 2(5), May
1997.
[14]. Niraj Singhal, Ashutosh Dixit, “Retrieving Informat ion from the Web and Search Engine Application”, in proceedings of National
Conference on “Emerging Trends in Software and Network Techniques– ETSNT’09”, Amity University, Noida, India, Apr 2009 .
[15]. Sergey Brin, Lawrence Page, “The anatomy of a large - scale hyper textual Web search engine”, Proceedings of the Seventh
International World Wide Web Conference, pages 107-117, April 1998.

Weitere ähnliche Inhalte

Was ist angesagt?

Web mining
Web miningWeb mining
Web miningSilicon
 
Comparative Analysis of Collaborative Filtering Technique
Comparative Analysis of Collaborative Filtering TechniqueComparative Analysis of Collaborative Filtering Technique
Comparative Analysis of Collaborative Filtering TechniqueIOSR Journals
 
Literature Survey on Web Mining
Literature Survey on Web MiningLiterature Survey on Web Mining
Literature Survey on Web MiningIOSR Journals
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET Journal
 
TEXT ANALYZER
TEXT ANALYZER TEXT ANALYZER
TEXT ANALYZER ijcseit
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningIJERA Editor
 
Classification of search_engine
Classification of search_engineClassification of search_engine
Classification of search_engineBookStoreLib
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimizationBookStoreLib
 
Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey  Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey dannyijwest
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniquesTola Odugbesan
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 

Was ist angesagt? (15)

E3602042044
E3602042044E3602042044
E3602042044
 
Pxc3893553
Pxc3893553Pxc3893553
Pxc3893553
 
Web mining
Web miningWeb mining
Web mining
 
Comparative Analysis of Collaborative Filtering Technique
Comparative Analysis of Collaborative Filtering TechniqueComparative Analysis of Collaborative Filtering Technique
Comparative Analysis of Collaborative Filtering Technique
 
Literature Survey on Web Mining
Literature Survey on Web MiningLiterature Survey on Web Mining
Literature Survey on Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search Results
 
TEXT ANALYZER
TEXT ANALYZER TEXT ANALYZER
TEXT ANALYZER
 
Aa03401490154
Aa03401490154Aa03401490154
Aa03401490154
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web Mining
 
Classification of search_engine
Classification of search_engineClassification of search_engine
Classification of search_engine
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimization
 
Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey  Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 

Andere mochten auch

Interfacing of Java 3D objects for Virtual Physics Lab (VPLab) Setup for enco...
Interfacing of Java 3D objects for Virtual Physics Lab (VPLab) Setup for enco...Interfacing of Java 3D objects for Virtual Physics Lab (VPLab) Setup for enco...
Interfacing of Java 3D objects for Virtual Physics Lab (VPLab) Setup for enco...IOSR Journals
 
Unification Scheme in Double Radio Sources: Bend Angle versus Arm Length Rati...
Unification Scheme in Double Radio Sources: Bend Angle versus Arm Length Rati...Unification Scheme in Double Radio Sources: Bend Angle versus Arm Length Rati...
Unification Scheme in Double Radio Sources: Bend Angle versus Arm Length Rati...IOSR Journals
 
Fluorescence quenching of 5-methyl-2-phenylindole (MPI) by carbon tetrachlori...
Fluorescence quenching of 5-methyl-2-phenylindole (MPI) by carbon tetrachlori...Fluorescence quenching of 5-methyl-2-phenylindole (MPI) by carbon tetrachlori...
Fluorescence quenching of 5-methyl-2-phenylindole (MPI) by carbon tetrachlori...IOSR Journals
 
Modeling and qualitative analysis of malaria epidemiology
Modeling and qualitative analysis of malaria epidemiologyModeling and qualitative analysis of malaria epidemiology
Modeling and qualitative analysis of malaria epidemiologyIOSR Journals
 
Significance of Solomon four group pretest-posttest method in True Experiment...
Significance of Solomon four group pretest-posttest method in True Experiment...Significance of Solomon four group pretest-posttest method in True Experiment...
Significance of Solomon four group pretest-posttest method in True Experiment...IOSR Journals
 
Design and Implementation of New Encryption algorithm to Enhance Performance...
Design and Implementation of New Encryption algorithm to  Enhance Performance...Design and Implementation of New Encryption algorithm to  Enhance Performance...
Design and Implementation of New Encryption algorithm to Enhance Performance...IOSR Journals
 
In Vitro Seed Germination Studies and Flowering in Micropropagated Plantlets ...
In Vitro Seed Germination Studies and Flowering in Micropropagated Plantlets ...In Vitro Seed Germination Studies and Flowering in Micropropagated Plantlets ...
In Vitro Seed Germination Studies and Flowering in Micropropagated Plantlets ...IOSR Journals
 
A comparison between M/M/1 and M/D/1 queuing models to vehicular traffic atKa...
A comparison between M/M/1 and M/D/1 queuing models to vehicular traffic atKa...A comparison between M/M/1 and M/D/1 queuing models to vehicular traffic atKa...
A comparison between M/M/1 and M/D/1 queuing models to vehicular traffic atKa...IOSR Journals
 

Andere mochten auch (20)

D010312127
D010312127D010312127
D010312127
 
F017112934
F017112934F017112934
F017112934
 
E0731929
E0731929E0731929
E0731929
 
Interfacing of Java 3D objects for Virtual Physics Lab (VPLab) Setup for enco...
Interfacing of Java 3D objects for Virtual Physics Lab (VPLab) Setup for enco...Interfacing of Java 3D objects for Virtual Physics Lab (VPLab) Setup for enco...
Interfacing of Java 3D objects for Virtual Physics Lab (VPLab) Setup for enco...
 
Unification Scheme in Double Radio Sources: Bend Angle versus Arm Length Rati...
Unification Scheme in Double Radio Sources: Bend Angle versus Arm Length Rati...Unification Scheme in Double Radio Sources: Bend Angle versus Arm Length Rati...
Unification Scheme in Double Radio Sources: Bend Angle versus Arm Length Rati...
 
Fluorescence quenching of 5-methyl-2-phenylindole (MPI) by carbon tetrachlori...
Fluorescence quenching of 5-methyl-2-phenylindole (MPI) by carbon tetrachlori...Fluorescence quenching of 5-methyl-2-phenylindole (MPI) by carbon tetrachlori...
Fluorescence quenching of 5-methyl-2-phenylindole (MPI) by carbon tetrachlori...
 
Modeling and qualitative analysis of malaria epidemiology
Modeling and qualitative analysis of malaria epidemiologyModeling and qualitative analysis of malaria epidemiology
Modeling and qualitative analysis of malaria epidemiology
 
Significance of Solomon four group pretest-posttest method in True Experiment...
Significance of Solomon four group pretest-posttest method in True Experiment...Significance of Solomon four group pretest-posttest method in True Experiment...
Significance of Solomon four group pretest-posttest method in True Experiment...
 
A010210109
A010210109A010210109
A010210109
 
A1103010108
A1103010108A1103010108
A1103010108
 
Design and Implementation of New Encryption algorithm to Enhance Performance...
Design and Implementation of New Encryption algorithm to  Enhance Performance...Design and Implementation of New Encryption algorithm to  Enhance Performance...
Design and Implementation of New Encryption algorithm to Enhance Performance...
 
In Vitro Seed Germination Studies and Flowering in Micropropagated Plantlets ...
In Vitro Seed Germination Studies and Flowering in Micropropagated Plantlets ...In Vitro Seed Germination Studies and Flowering in Micropropagated Plantlets ...
In Vitro Seed Germination Studies and Flowering in Micropropagated Plantlets ...
 
W01761157162
W01761157162W01761157162
W01761157162
 
I012116164
I012116164I012116164
I012116164
 
K017346572
K017346572K017346572
K017346572
 
J012265658
J012265658J012265658
J012265658
 
A comparison between M/M/1 and M/D/1 queuing models to vehicular traffic atKa...
A comparison between M/M/1 and M/D/1 queuing models to vehicular traffic atKa...A comparison between M/M/1 and M/D/1 queuing models to vehicular traffic atKa...
A comparison between M/M/1 and M/D/1 queuing models to vehicular traffic atKa...
 
F017264143
F017264143F017264143
F017264143
 
L017146571
L017146571L017146571
L017146571
 
D1803041822
D1803041822D1803041822
D1803041822
 

Ähnlich wie Design Issues for Search Engines and Web Crawlers

Effective Searching Policies for Web Crawler
Effective Searching Policies for Web CrawlerEffective Searching Policies for Web Crawler
Effective Searching Policies for Web CrawlerIJMER
 
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningA Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningIJMTST Journal
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlerishmecse13
 
Notes for
Notes forNotes for
Notes for9pallen
 
How search engine works
How search engine worksHow search engine works
How search engine worksAshraf Ali
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areasinventionjournals
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than GoogleDr Trivedi
 
Using Exclusive Web Crawlers to Store Better Results in Search Engines' Database
Using Exclusive Web Crawlers to Store Better Results in Search Engines' DatabaseUsing Exclusive Web Crawlers to Store Better Results in Search Engines' Database
Using Exclusive Web Crawlers to Store Better Results in Search Engines' DatabaseIJwest
 
Search engines by Gulshan K Maheshwari(QAU)
Search engines by Gulshan  K Maheshwari(QAU)Search engines by Gulshan  K Maheshwari(QAU)
Search engines by Gulshan K Maheshwari(QAU)GulshanKumar368
 
HIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPagesHIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPagesijdkp
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainseSAT Publishing House
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainseSAT Journals
 

Ähnlich wie Design Issues for Search Engines and Web Crawlers (20)

G017254554
G017254554G017254554
G017254554
 
DWM-MODULE 6.pdf
DWM-MODULE 6.pdfDWM-MODULE 6.pdf
DWM-MODULE 6.pdf
 
Effective Searching Policies for Web Crawler
Effective Searching Policies for Web CrawlerEffective Searching Policies for Web Crawler
Effective Searching Policies for Web Crawler
 
Search engine
Search engineSearch engine
Search engine
 
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningA Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Notes for
Notes forNotes for
Notes for
 
How search engine works
How search engine worksHow search engine works
How search engine works
 
`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas`A Survey on approaches of Web Mining in Varied Areas
`A Survey on approaches of Web Mining in Varied Areas
 
SearchEngine.pptx
SearchEngine.pptxSearchEngine.pptx
SearchEngine.pptx
 
Search Engines Other than Google
Search Engines Other than GoogleSearch Engines Other than Google
Search Engines Other than Google
 
Using Exclusive Web Crawlers to Store Better Results in Search Engines' Database
Using Exclusive Web Crawlers to Store Better Results in Search Engines' DatabaseUsing Exclusive Web Crawlers to Store Better Results in Search Engines' Database
Using Exclusive Web Crawlers to Store Better Results in Search Engines' Database
 
Search engines by Gulshan K Maheshwari(QAU)
Search engines by Gulshan  K Maheshwari(QAU)Search engines by Gulshan  K Maheshwari(QAU)
Search engines by Gulshan K Maheshwari(QAU)
 
HIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPagesHIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPages
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
Seo
SeoSeo
Seo
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domains
 
Focused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domainsFocused web crawling using named entity recognition for narrow domains
Focused web crawling using named entity recognition for narrow domains
 
Wp10
Wp10Wp10
Wp10
 

Mehr von IOSR Journals (20)

A011140104
A011140104A011140104
A011140104
 
M0111397100
M0111397100M0111397100
M0111397100
 
L011138596
L011138596L011138596
L011138596
 
K011138084
K011138084K011138084
K011138084
 
J011137479
J011137479J011137479
J011137479
 
I011136673
I011136673I011136673
I011136673
 
G011134454
G011134454G011134454
G011134454
 
H011135565
H011135565H011135565
H011135565
 
F011134043
F011134043F011134043
F011134043
 
E011133639
E011133639E011133639
E011133639
 
D011132635
D011132635D011132635
D011132635
 
C011131925
C011131925C011131925
C011131925
 
B011130918
B011130918B011130918
B011130918
 
A011130108
A011130108A011130108
A011130108
 
I011125160
I011125160I011125160
I011125160
 
H011124050
H011124050H011124050
H011124050
 
G011123539
G011123539G011123539
G011123539
 
F011123134
F011123134F011123134
F011123134
 
E011122530
E011122530E011122530
E011122530
 
D011121524
D011121524D011121524
D011121524
 

Kürzlich hochgeladen

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 

Kürzlich hochgeladen (20)

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 

Design Issues for Search Engines and Web Crawlers

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661, p- ISSN: 2278-8727Volume 15, Issue 6 (Nov. - Dec. 2013), PP 34-37 www.iosrjournals.org www.iosrjournals.org 34 | Page Design Issues for Search Engines and Web Crawlers: A Review Deepak Kumar, Aditya Kumar WCTM, Gurgaon, Haryana, Abstract: The World Wide Web is a huge source of hyperlinked information contained in hypertext documents. Search engines use web crawlers to collect these web documents from web for storage and indexing. The prompt growth of the World Wide Web has posed incomparable challenges for the designers of search engines and web crawlers; that help users to retrieve web pages in a reasonable amount of time. In this paper, a review on need and working of a search engine, and role of a web crawler is being presented. Key words: Internet, www, search engine, types, design issues, web crawlers. I. Introduction Today Internet [1,5] has become the universal source of information. Within a span of few years, it has changed the way we business and communicate. It has given a worldwide platform to us. The World Wide Web (WWW or web) [1, 2, 4, 12, 13] is a hyperlinked repository of hypertext documents lying in different websites distributed over far end distant geographical locations. There is unbelievable growth in the number of web documents and it is estimated that currently there are millions of web servers around the world hosting trillions of web pages. With such a huge web repository, finding the right information at the right time is a very challenging task. Hence, there is a need for a more effective way of retrieving information from the web. Search engines [3,7,8,9,11,12,13,14] operate as a link between web users and web documents. Without search engines, this vast source of information in web pages remain veiled for us. A search engine is a searchable database which collects information from web pages on the Internet, indexes the information and then stores the result in a huge database where from it can be searched quickly. II. Search Engines A general web search engine (Fig. 1) has three parts [3,10,11,12,13,14] i.e. Crawler, Indexer and Query engine. The web crawler (also called robot, spider, worm, walker or wanderer) is a module that searches the web pages from the web world. These are small programs that peruse the web on the search engine's behalf, and follow links to reach different pages. Starting with a set of seed URLs, crawlers extract URLs appearing in the retrieved pages, and store pages in a repository database. The indexer extracts all the words from each page and records the URL where each word has occurred. The result is stored in a large table containing URLs; pointing to pages in the repository where a given word occurs. Indexing methods used in web database creation are full text indexing, keyword indexing and human indexing. Full text indexing is where every word on the page is put into a database for searching. It helps user to find every example in response to a specific name or terminology. A general topic search will not be very useful in the database and one has to dig through a lot of false drops. In keyword indexing only important words or phrases are put into a database. In human indexing a person examines the page and determines a very few key phrases that describes it. This allows the user to find a good start of works on a topic, assuming that the topic was picked by the human as something that describes the page. The query engine is responsible for receiving and filling search requests from users. It relies on the indexes and on the repository. Because of the web's size, and the fact that users typically only enter one or two keywords, result sets are usually very large. Fig. 1 : Architecture of a typical web search engine
  • 2. Design Issues for Search Engines and Web Crawlers: A Review www.iosrjournals.org 35 | Page 2.1 Types of Search Engines Search engines are good at finding unique keywords, phrases, quotes, and information contained in the full text of web pages. Search engines allow user to enter keywords and then search for them in its table followed by database. Various types [9] of search engines available are Crawler based search engines, Human powered directories, Meta search engines and Hybrid search engines. i. Crawler based search engines : Crawler based search engines create their listings (that make up the search engine's index or catalog) automatically with the help of web crawlers. It uses a computer algorithm to rank all pages retrieved. Such search engines are huge and often retrieve a lot of information. For complex searches, it allows to search within the results of previous search and enables us to refine search results. Such types of search engines contain full text of web pages they link to. Here, one can find pages by matching words in the pages user wants. ii. Human powered directories : Human powered directories are built by human selection i.e. they depend on humans to create repository. They are organized into subject categories and classification of pages is done by subjects. Such directories do not contain full text of the web page they link to, and are smaller than most search engines. iii. Hybrid search engine : Hybrid search engines differ from traditional text oriented and directory based search engines. These search engines typically favor one type of listing over the other. Many search engines today combine a crawler based search engine with a directory service. iv. Meta search engines : Meta search engines accumulate search and screen the results of multiple primary search engines. Unlike search engines, meta crawlers don't crawl the web themselves to build listings. Instead, they allow searches to be sent to several search engines all at once. The results are then blended together onto one page. Based on the application for which search engines are used, they can be categorized as follows:- i. Primary search engines scan entire sections of the www and produce their results from databases of web page content, automatically created by computers. ii. Business and Services search engines essentially National yellow page directories. iii. Employment and Job search engines either provide potential employers access to resumes of people interested in working for them or provide prospective employees with information on job availability. iv. Finance-oriented search engines facilitate searches for specific information about companies (officers, annual reports etc.). v. Image search engines help us search the www for images of all kinds. vi. News search engines search newspaper and news web site archives for the selected information. vii. People search engines search for names, addresses, telephone numbers and e-mail addresses. viii. Subject guides are like indexes in the back of a book. They involve human intervention in selecting and organizing resources, so they cover fewer resources and topics but provide more focus and guidance. ix. Specialized search engines search specialized databases, allow users to enter search terms in a particularly easy way, look for low prices on items they are interested in purchasing, and even give users access to real, live human beings to answer questions. 2.2 Design issues Designing a search engine [7,8,9,10,11] is a tricky task and there are differences in the ways they work. In common they all perform three basic tasks; search the Internet based on important words, keep an index of the words they find and where they find them, and allow users to look for words (or combinations of words) found in that index. There are different ways to improve performance of search engines but three main characteristics are improving algorithms to search the web, using filters towards the user’s results; and improving the user interface for query input. Factors that determine the quality of a search engine are freshness of contents, index quality, search features, retrieval system and user behaviour. Few more issues are :- i. Primary goals: The primary goal of a search engine is to provide high quality search results over a rapidly growing world wide web. The most important appraise for a search engine is its search performance, quality of the results, ability to crawl and indexing the web efficiently.
  • 3. Design Issues for Search Engines and Web Crawlers: A Review www.iosrjournals.org 36 | Page ii. Diversity of documents: On the web, documents are written in several different languages. Documents need to be indexed in a way which allows it to search for documents written in diverse languages with just one query. Search engines should be able to index documents written in multiple formats, as each file format provides certain difficulties for the search engines. iii. Behaviour of web users: Web users are very heterogeneous and search engines are used by professionals as well as by laymen. The search engine needs to be enough smart to cater for both types of users. Moreover, there is a tendency that users often only look at the results set that can be seen without scrolling, and the results which are not on first page are nearly invisible for the general user. iv. Freshness of database : Search engines find problems in keeping its database up-to-date with the entire web because of its enormous size and the different update cycles of individual websites. Other characteristics [10] that a large search engine is expected to have are scalability, high performance, politeness, continuity, extensibility and portability. III. Web Crawlers Web crawler (Fig. 2) is a program that fetches information from www in an automated manner. The objective of a web crawler is to maintain freshness [6] of pages in its collection as high as possible. To download a document, a crawler starts by placing an initial set of seed URLs in a queue, where all URLs to be retrieved are kept and prioritized. From this queue the crawler extracts a URL, downloads the page, extracts URLs from the downloaded page, and places the new URLs in the queue. This process is repeated and the collected pages are used by a search engine. The browser parses the document and makes it available to the user. The general algorithm of a web crawler is given below : Begin Read a URL from the set of seed URLs; Determine the IP address for the host name; Download the Robot.txt file which carries downloading permissions and also specifies the files to be excluded by the crawler; Determine the protocol of underlying host like http, ftp, gopher etc.; Based on the protocol of the host, download the document; Identify the document format like doc, html, or pdf etc.; Check whether the document has already been downloaded or not; If the document is fresh one then Read it and extract the links or references to the other cites from that documents; else Continue; Convert the URL links into their absolute URL equivalents; Add the URLs to set of seed URLs; End. A web crawler deals with two main issues, a good crawling strategy for deciding which pages to download next and, to have a highly optimized system architecture that can download a large number of pages per second. On other side it has to be robust against crashes, manageable, and considerate of resources and web servers. Fig. 3.1 : General architecture of a web crawler
  • 4. Design Issues for Search Engines and Web Crawlers: A Review www.iosrjournals.org 37 | Page Several available web crawling techniques are Parallel crawler, Distributed crawler, Focussed crawler, Hidden crawler, Incremental crawler, that differs in their mechanism, implementation and objective. IV. Conclusion and future research In this paper we conclude that designing a search engine to crawl the complete web is a nontrivial task. They have to be enough committed to perform the crawling process efficiently and reliably. Besides some design issues of a search engine, we also discussed that a web crawler supports the search engine to update its database to improve quality of contents retrieved by the user. From several available web crawling techniques, appropriate one may be used to crawl the web while designing a search engine. References [1]. A.K. Sharma, J. P. Gupta, D. P. Agarwal, “Augment Hypertext Documents suitable for parallel crawlers” , accepted for presentation and inclusion in the proceedings of WITSA-2003, a National workshop on Information Technology Services and Applications, Feb’2003, New Delhi. [2]. Alexandros Ntoulas, Junghoo Cho, Christopher Olston, "What's New on the Web ? The Evolution of the Web from a Search Engine Perspective", In Proceedings of the World-Wide Web Conference (WWW), May 2004. [3]. Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan, "Searching the Web", ACM Transactions on Internet Technology, 1(1): August 2001. [4]. Baldi, Pierre, “Modeling the Internet and the Web: Probabilistic Methods and Algorithms”, 2003. [5]. Barry M. Leiner, Vinton G. Cerf, David D. Clark, Robert E. Kahn, Leonard, Kleinrock, Daniel C. Lynch, Jon [6]. Postel, Larry G. Roberts, Stephen Wolff, “A Brief History of the Internet”, www.isoc.org/internet/his tory, [7]. Brian E. Brewington, George Cybenko, “How dynamic i s the web.”, In Proceedings of the Ninth Internationa l World-Wide Web Conference, Amsterdam, Netherlands, May 2000. [8]. Dirk Lewandowski, “Web searching, search engines an d Information Retrieval, Information Services & Use”, 25 (2005) 137-147, IOS Press, 2005. [9]. Franklin, Curt, “How Internet Search Engines Work”, 2002, www.howstuffworks.com. [10]. Grossan. B, “Search Engines : What they are, how th ey work, and practical suggestions for getting the most out of them”, Feb97, webreference.com. [11]. Heydon A., Najork M., “Mercator: A scalable, extens ible Web crawler.”, World Wide Web, vol. 2, no. 4, pp. 2 19-229, 1999. [12]. Mark Najork, Allan Heydon, “High- Performance Web Crawling”, September 2001. [13]. Mike, Burner, “Crawling towards Eternity : Building an archive of the World Wide Web”, Web Techniques Magazine, 2(5), May 1997. [14]. Niraj Singhal, Ashutosh Dixit, “Retrieving Informat ion from the Web and Search Engine Application”, in proceedings of National Conference on “Emerging Trends in Software and Network Techniques– ETSNT’09”, Amity University, Noida, India, Apr 2009 . [15]. Sergey Brin, Lawrence Page, “The anatomy of a large - scale hyper textual Web search engine”, Proceedings of the Seventh International World Wide Web Conference, pages 107-117, April 1998.