1. Page 1 of 19
Term Paper Presentation titled
SEARCH ENGINE(s)
Submitted in partial fulfillment of the requirements for the award
of
BACHELOR’S IN COMPUTER APPLICATIONS (BCA)
of Integral University, Lucknow.
Session: 2004-07
Submitted by
PRASHANT MATHUR
Roll Number: 0400518017
Under the guidance of Mr. Pavan Srivastava
Name and Address of the Study Centre
2. Page 2 of 19
UPTEC Computer Consultancy Limited, Kapoorthala, Lucknow.
Contents
i. Prelude
ii. History
iii. Challenges faced by search engines
iv. How search engines work
v. Storage costs and crawling time
vi. Geospatially enabled search engines
vii. Vertical Search engines
viii. Search Engine Optimizer
ix. Page Rank
x. GOOGLE Architecture Overview
xi. Conclusions
3. Page 3 of 19
WHAT IS A SEARCH ENGINE?
Prelude
With billions of items of information scattered around the World Wide Web how
do you find what you are looking for? Someone might tell you the address of
an interesting site. You might hear of an address from the TV, radio or a
magazine. Without search engines these would be your only ways of finding
things. Search engines use computer programs that spend the whole time
trawling through the vast amount of information on the web. They create huge
indexes of all this information. You can go to the web page of a search engine
and type in what you are looking for. The search engine software will look
through its indexes and give you a list of Web pages that contain the words
you typed.
Search Engine, computer software that compiles lists of documents, most
commonly those on the World Wide Web (WWW), and the contents of those
documents. Search engines respond to a user entry, or query, by searching
the lists and displaying a list of documents (called Web sites when on the
WWW) that match the search query. Some search engines include the
opening portion of the text of Web pages in their lists, but others include only
the titles or addresses (known as Universal Resource Locators, or URLs) of
Web pages. Some search engines occur separately from the WWW, indexing
documents on a local area network or other system.
The major global general-purpose search engines include Google, Yahoo!,
MSN Search, AltaVista, Lycos, and HotBot. Yahoo!—one of the first available
search engines—differs from most other search sites because the content and
listings are manually compiled and organized by subject into a directory. As of
January 2005, Google ranked as the most comprehensive search engine
available, with over four trillion pages indexed.
These engines operate by building—and regularly updating—an enormous
index of Web pages and files. This is done with the help of a Web crawler, or
spider, a kind of automated browser that perpetually trolls the Web, retrieving
each page it finds. Pages are then indexed according to the words they
contain, with special treatment given to words in titles and other headers.
When a user inputs a query, the search engine then scans the index and
retrieves a list of pages that seem to best fit what the user is looking for.
Search engines often return results in fractions of a second.
Generally, when an engine displays a list of results, pages are ranked
according to how many other sites link to those pages. The assumption is that
the more useful a site is, the more often other sites will send users to it.
4. Page 4 of 19
Google pioneered this technique in the late 1990s with a technology called
PageRank. But this is not the only way of ranking results. Dozens of other
criteria are used, and these will vary from engine to engine.
Google
A Search Engine is an information retrieval system designed to help find
information stored on a computer system, such as on the World Wide Web,
inside a corporate or proprietary network, or in a personal computer. The
search engine allows one to ask for content meeting specific criteria (typically
those containing a given word or phrase) and retrieves a list of items that
match those criteria. This list is often sorted with respect to some measure of
relevance of the results. Search engines use regularly updated indexes to
operate quickly and efficiently.
Without further qualification, search engine usually refers to a Web search
engine, which searches for information on the public Web. Other kinds of
search engine are enterprise search engines, which search on intranets,
personal search engines, and mobile search engines. Different selection and
relevance criteria may apply in different environments, or for different uses.
Some search engines also mine data available in newsgroups, databases, or
open directories. Unlike Web directories, which are maintained by human
editors, search engines operate algorithmically or are a mixture of algorthmic
and human input.
History
The very first tool used for searching on the Internet was Archie.[1] The name
stands for "archive" without the "v". It was created in 1990 by Alan Emtage, a
student at McGill University in Montreal. The program downloaded the
directory listings of all the files located on public anonymous FTP (File
Transfer Protocol) sites, creating a searchable database of filenames;
however, Archie could not search by file contents.
While Archie indexed computer files, Gopher indexed plain text documents.
Gopher was created in 1991 by Mark McCahill at the University of Minnesota;
5. Page 5 of 19
Gopher was named after the school's mascot. Because these were text files,
most of the Gopher sites became websites after the creation of the World
Wide Web.
Two other programs, Veronica and Jughead, searched the files stored in
Gopher index systems. Veronica (Very Easy Rodent-Oriented Net-wide Index
to Computerized Archives) provided a keyword search of most Gopher menu
titles in the entire Gopher listings. Jughead (Jonzy's Universal Gopher
Hierarchy Excavation And Display) was a tool for obtaining menu information
from various Gopher servers. While the name of the search engine "Archie"
was not a reference to the Archie comic book series, "Veronica" and
"Jughead" are characters in the series, thus referencing their predecessor.
The first Web search engine was Wandex, a now-defunct index collected by
the World Wide Web Wanderer, a web crawler developed by Matthew Gray at
MIT in 1993. Another very early search engine, Aliweb, also appeared in 1993,
and still runs today. The first "full text" crawler-based search engine was
WebCrawler, which came out in 1994. Unlike its predecessors, it let users
search for any word in any webpage, which became the standard for all major
search engines since. It was also the first one to be widely known by the
public. Also in 1994 Lycos (which started at Carnegie Mellon University) came
out, and became a major commercial endeavor.
Soon after, many search engines appeared and vied for popularity. These
included Excite, Infoseek, Inktomi, Northern Light, and AltaVista. In some
ways, they competed with popular directories such as Yahoo!. Later, the
directories integrated or added on search engine technology for greater
functionality.
Search engines were also known as some of the brightest stars in the Internet
investing frenzy that occurred in the late 1990s. Several companies entered
the market spectacularly, receiving record gains during their initial public
offerings. Some have taken down their public search engine, and are
marketing enterprise-only editions, such as Northern Light.
Its success was based in part on the concept of link popularity and PageRank.
The number of other websites and webpages that link to a given page is taken
into consideration with PageRank, on the premise that good or desirable
pages are linked to more than others. The PageRank of linking pages and the
number of links on these pages contribute to the PageRank of the linked page.
This makes it possible for Google to order its results by how many websites
link to each found page. Google's minimalist user interface is very popular with
users, and has since spawned a number of imitators.
Google and most other web engines utilize not only PageRank but more than
150 criteria to determine relevancy. The algorithm "remembers" where it has
been and indexes the number of cross-links and relates these into groupings.
6. Page 6 of 19
PageRank is based on citation analysis that was developed in the 1950s by
Eugene Garfield at the University of Pennsylvania. Google's founders cite
Garfield's work in their original paper. In this way virtual communities of
webpages are found. Teoma's search technology uses a communities
approach in its ranking algorithm. NEC Research Institute has worked on
similar technology. Web link analysis was first developed by Jon Kleinberg and
his team while working on the CLEVER project at IBM's Almaden Research
Center. Google is currently the most popular search engine.
Yahoo! Search
The two founders of Yahoo!, David Filo and Jerry Yang, Ph.D. candidates in
Electrical Engineering at Stanford University, started their guide in a campus
trailer in February 1994 as a way to keep track of their personal interests on
the Internet. Before long they were spending more time on their home-brewed
lists of favorite links than on their doctoral dissertations. Eventually, Jerry and
David's lists became too long and unwieldy, and they broke them out into
categories. When the categories became too full, they developed
subcategories ... and the core concept behind Yahoo! was born. In 2002,
Yahoo! acquired Inktomi and in 2003, Yahoo! acquired Overture, which owned
AlltheWeb and AltaVista. Despite owning its own search engine, Yahoo!
initially kept using Google to provide its users with search results on its main
website Yahoo.com. However, in 2004, Yahoo! launched its own search
engine based on the combined technologies of its acquisitions and providing a
service that gave pre-eminence to the Web search engine over the directory.
Microsoft
The most recent major search engine is MSN Search (evolved into Windows
Live Search), owned by Microsoft, which previously relied on others for its
search engine listings. In 2004 it debuted a beta version of its own results,
powered by its own web crawler (called msnbot). In early 2005 it started
showing its own results live. This was barely noticed by average users
unaware of where results come from, but was a huge development for many
webmasters, who seek inclusion in the major search engines. At the same
time, Microsoft ceased using results from Inktomi, now owned by Yahoo!. In
2006, Microsoft migrated to a new search platform - Windows Live Search,
retiring the "MSN Search" name in the process.
Challenges faced by search engines
a. The Web is growing much faster than any present-technology search
engine can possibly index (see distributed web crawling). In 2006, some
users found major search-engines became slower to index new WebPages.
b. Many WebPages are updated frequently, which forces the search engine to
revisit them periodically.
7. Page 7 of 19
c. The queries one can make are currently limited to searching for key words,
which may result in many false positives, especially using the default
whole-page search. Better results might be achieved by using a proximity-
search option with a search-bracket to limit matches within a paragraph or
phrase, rather than matching random words scattered across large pages.
Another alternative is using human operators to do the researching for the
user with organic search engines.
d. Dynamically generated sites may be slow or difficult to index, or may result
in excessive results, perhaps generate 500 times more WebPages than
average. Example: for a dynamic webpage which changes content based
on entries inserted from a database, a search-engine might be requested to
index 50,000 static WebPages for 50,000 different parameter values
passed to that dynamic webpage.
e. Many dynamically generated websites are not indexable by search
engines; this phenomenon is known as the invisible web. There are
search engines that specialize in crawling the invisible web by crawling
sites that have dynamic content, require forms to be filled out, or are
password protected.
f. Relevancy: sometimes the engine can't get what the person is looking for.
g. Some search-engines do not rank results by relevance, but by the amount
of money the matching websites pay.
h. In 2006, hundreds of generated websites used tricks to manipulate a
search-engine to display them in the higher results for numerous keywords.
This can lead to some search results being polluted with linkspam or bait-
and-switch pages which contain little or no information about the matching
phrases. The more relevant WebPages are pushed further down in the
results list, perhaps by 500 entries or more.
i. Secure pages (content hosted on HTTPS URLs) pose a challenge for
crawlers which either can't browse the content for technical reasons or
won't index it for privacy reasons.
How search engines work
A search engine operates, in the following order
a. Web crawling
b. Indexing
c. Searching
Web search engines work by storing information about a large number of web
pages, which they retrieve from the WWW itself. These pages are retrieved by
a Web crawler (sometimes also known as a spider) — an automated Web
browser which follows every link it sees. Exclusions can be made by the use of
robots.txt. The contents of each page are then analyzed to determine how it
should be indexed (for example, words are extracted from the titles, headings,
8. Page 8 of 19
or special fields called meta tags). Data about web pages are stored in an
index database for use in later queries. Some search engines, such as
Google, store all or part of the source page (referred to as a cache) as well as
information about the web pages, whereas others, such as AltaVista, store
every word of every page they find. This cached page always holds the actual
search text since it is the one that was actually indexed, so it can be very
useful when the content of the current page has been updated and the search
terms are no longer in it. This problem might be considered to be a mild form
of linkrot, and Google's handling of it increases usability by satisfying user
expectations that the search terms will be on the returned webpage. This
satisfies the principle of least astonishment since the user normally expects
the search terms to be on the returned pages. Increased search relevance
makes these cached pages very useful, even beyond the fact that they may
contain data that may no longer be available elsewhere.
When a user comes to the search engine and makes a query, typically by
giving key words, the engine looks up the index and provides a listing of best-
matching web pages according to its criteria, usually with a short summary
containing the document's title and sometimes parts of the text. Most search
engines support the use of the boolean terms AND, OR and NOT to further
specify the search query. An advanced feature is proximity search, which
allows users to define the distance between keywords.
The usefulness of a search engine depends on the relevance of the result set
it gives back. While there may be millions of webpages that include a
particular word or phrase, some pages may be more relevant, popular, or
authoritative than others. Most search engines employ methods to rank the
results to provide the "best" results first. How a search engine decides which
pages are the best matches, and what order the results should be shown in,
varies widely from one engine to another. The methods also change over time
as Internet usage changes and new techniques evolve.
Most Web search engines are commercial ventures supported by advertising
revenue and, as a result, some employ the controversial practice of allowing
advertisers to pay money to have their listings ranked higher in search results.
Those search engines which do not accept money for their search engine
results make money by running search related ads alongside the regular
search engine results. The search engines make money everytime someone
clicks on one of these ads.
The vast majorities of search engines are run by private companies using
proprietary algorithms and closed databases, though some are open source.
Storage C osts and Crawling Time
Storage costs are not the limiting resource in search engine implementation.
Simply storing 10 billion pages of 10 KB each (compressed) requires 100TB
9. Page 9 of 19
and another 100TB or so for indexes, giving a total hardware cost of under
$200k: 100 cheap PCs each with four 500GB disk drives.
However, a public search engine requires considerably more resources than
this to calculate query results and to provide high availability. Also, the costs of
operating a large server farm are not trivial.
Crawling 10B pages with 100 machines crawling at 100 pages/second would
take 1M seconds, or 11.6 days on a very high capacity Internet connection.
Most search engines crawl a small fraction of the Web (10-20% pages) at
around this frequency or better, but also crawl dynamic websites (e.g. news
sites and blogs) at a much higher frequency.
Geospatially enabled Search Engines
A recent enhancement to search engine technology is the addition of
geocoding and geoparsing to the processing of the ingested documents being
indexed, to enable searching within a specified locality (or region). Geoparsing
attempts to match any found references to locations and places to a
geospatial frame of reference, such as a street address, gazetteer locations,
or to an area (such as a polygonal boundary for a municipality). Through this
geoparsing process, latitudes and longitudes are assigned to the found places,
and these latitudes and longitudes are indexed for later spatial query and
retrieval. This can enhance the search process tremendously by allowing a
user to search for documents within a given map extent, or conversely, plot
the location of documents matching a given keyword to analyze incidence and
clustering, or any combination of the two. See the list of search engines for
examples of companies which offer this feature.
Vertical Search Engines
Vertical search engines or specialized search engines are search engines
which specialize in specific content categories or that search within a specific
media. Popular Search engines, like Google or Yahoo!, are very effective
when the user searches for web sites, web pages or general information.
Vertical search engines enable the user to find specific types of listings thus
making the search more customized to the user's needs.
Search Engine Optimization (SEO)
A subset of search engine marketing, is the process of improving the volume
and quality of traffic to a web site from search engines via "natural" ("organic"
or "algorithmic") search results. SEO can also target specialized searches
such as image search, local search, and industry-specific vertical search
engines.
SEO is marketing by understanding how search algorithms work and what
human visitors might search for, to help match those visitors with sites offering
10. Page 10 of 19
what they are interested in finding. Some SEO efforts may involve optimizing a
site's coding, presentation, and structure, without making very noticeable
changes to human visitors, such as incorporating a clear hierarchical structure
to a site, and avoiding or fixing problems that might keep search engine
indexing programs from fully spidering a site. Other, more noticeable efforts,
involve including unique content on pages that can be easily indexed and
extracted from those pages by search engines while also appealing to human
visitors.
A typical Search Engine Results Page (SERP)
The term SEO can also refer to "search engine optimizers," a term adopted by
an industry of consultants who carry out optimization projects on behalf of
clients, and by employees of site owners who may perform SEO services in-
house. Search engine optimizers often offer SEO as a stand-alone service or
as a part of a larger marketing campaign. Because effective SEO can require
making changes to the source code of a site, it is often very helpful when
incorporated into the initial development and design of a site, leading to the
use of the term "Search Engine Friendly" to describe designs, menus, content
management systems and shopping carts that can be optimized easily and
effectively.
Optimizing for Traffic Quality
In addition to seeking better rankings, search engine optimization is also
concerned with traffic quality. Traffic quality is measured by how often a visitor
using a specific keyword phrase leads to a desired conversion action, such as
making a purchase, viewing or downloading a certain page, requesting further
information, signing up for a newsletter, or taking some other specific action.
By improving the quality of a page's search listings, more searchers may
select that page and those searchers may be more likely to convert. Examples
of SEO tactics to improve traffic quality include writing attention-grabbing titles,
11. Page 11 of 19
adding accurate meta descriptions, and choosing a domain and URL that
improve the site's branding.
Relationship between SEO and Search
Engines
By 1997 search engines recognized that some webmasters were making
efforts to rank well in their search engines, and even manipulating the page
rankings in search results. In some early search engines, such as Infoseek,
ranking first was as easy as grabbing the source code of the top-ranked page,
placing it on your website, and submitting a URL to instantly index and rank
that page. Due to the high value and targeting of search results, there is
potential for an adversarial relationship between search engines and SEOs. In
2005, an annual conference named AirWeb was created to discuss bridging
the gap and minimizing the sometimes damaging effects of aggressive web
content providers.
Some more aggressive site owners and SEOs generate automated sites or
employ techniques that eventually get domains banned from the search
engines. Many search engine optimization companies, which sell services,
employ long-term, low-risk strategies, and most SEO firms that do employ
high-risk strategies do so on their own affiliate, lead-generation, or content
sites, instead of risking client websites.
Some SEO companies employ aggressive techniques that get their client
websites banned from the search results. The Wall Street Journal profiled a
company, Traffic Power, that allegedly used high-risk techniques and failed to
disclose those risks to its clients. Wired reported the same company sued a
blogger for mentioning that they were banned. Google’s Matt Cutts later
confirmed that Google did in fact ban Traffic Power and some of its clients.
Some search engines have also reached out to the SEO industry, and are
frequent sponsors and guests at SEO conferences and seminars. In fact, with
the advent of paid inclusion, some search engines now have a vested interest
in the health of the optimization community. All of the main search engines
provide information/guidelines to help with site optimization: Google's,
Yahoo!'s, MSN's and Ask.com's. Google has a Sitemaps program to help
webmasters learn if Google is having any problems indexing their website and
also provides data on Google traffic to the website. Yahoo! has Site Explorer
that provides a way to submit your URLs for free (like MSN/Google),
determine how many pages are in the Yahoo! index and drill down on inlinks
to deep pages. Yahoo! has an Ambassador Program and Google has a
program for qualifying Google Advertising Professionals.
Types of SEO
12. Page 12 of 19
SEO techniques are classified by some into two broad categories: techniques
that search engines recommend as part of good design and those techniques
that search engines do not approve of and attempt to minimize the effect of,
referred to as spamdexing. Most professional SEO consultants do not offer
spamming and spamdexing techniques amongst the services that they provide
to clients. Some industry commentators classify these methods, and the
practitioners who utilize them, as either "white hat SEO", or "black hat SEO".
Many SEO consultants reject the black and white hat dichotomy as a
convenient but unfortunate and misleading over-simplification that makes the
industry look bad as a whole.
White Hat
An SEO tactic, technique or method is considered "White hat" if it conforms to
the search engines' guidelines and/or involves no deception. As the search
engine guidelines are not written as a series of rules or commandments, this is
an important distinction to note. White Hat SEO is not just about following
guidelines, but is about ensuring that the content a search engine indexes and
subsequently ranks is the same content a user will see.
White Hat advice is generally summed up as creating content for users, not for
search engines, and then makes that content easily accessible to their
spiders, rather than game the system. White hat SEO is in many ways similar
to web development that promotes accessibility, although the two are not
identical.
Black hat /Spamdexing
"Black hat" SEO are methods to try to improve rankings that are disapproved
of by the search engines and/or involve deception. This can range from text
that is "hidden", either as text colored similar to the background or in an
invisible or left of visible div, or by redirecting users from a page that is built for
search engines to one that is more human friendly. A method that sends a
user to a page that was different from the page the search engine ranked is
Black hat as a rule. One well known example is Cloaking, the practice of
serving one version of a page to search engine spiders/bots and another
version to human visitors.
Search engines may penalize sites they discover using black hat methods,
either by reducing their rankings or eliminating their listings from their
databases altogether. Such penalties can be applied either automatically by
the search engines' algorithms or by a manual review of a site.
‘Archie’ Search Engine
Archie is a tool for indexing FTP archives, allowing people to find specific
files. It is considered to be the first Internet search engine. The original
implementation was written in 1990 by Alan Emtage, Bill Heelan, and Peter J.
Deutsch, then students at McGill University in Montreal. The earliest versions
13. Page 13 of 19
of archie simply contacted a list of FTP archives on a regular basis (contacting
each roughly once a month, so as not to waste too many resources on the
remote servers) and requested a listing. These listings were stored in local
files to be searched using the UNIX grep command. Later, more efficient front-
and back-ends were developed, and the system spread from a local tool, to a
network-wide resource, to a popular service available from multiple sites
around the Internet. Such archie servers could be accessed in multiple ways:
using a local client (such as archie or xarchie); telneting to a server directly;
sending queries by electronic mail; and later via World Wide Web interfaces.
The name derives from the word "archive", but is also associated with the
comic book series of the same name. This was not originally intended, but it
certainly acted as the inspiration for the names of Jughead and Veronica, both
search systems for the Gopher protocol, named after other characters from
the same comics.
The World Wide Web made searching for files much easier, and there are
currently very few archie servers in operation. One gateway can be found in
Poland.
System Features
The Google search engine has two important features that help it produce high
precision results. First, it makes use of the link structure of the Web to
calculate a quality ranking for each web page. This ranking is called
PageRank. Second, Google utilizes link to improve search results.
Page Rank: Bringing Order to the Web
The citation (link) graph of the web is an important resource that has largely
gone unused in existing web search engines. We have created maps
containing as many as 518 million of these hyperlinks, a significant sample of
the total. These maps allow rapid calculation of a web page's "PageRank", an
objective measure of its citation importance that corresponds well with
people's subjective idea of importance. Because of this correspondence,
PageRank is an excellent way to prioritize the results of web keyword
searches. For most popular subjects, a simple text matching search that is
restricted to web page titles performs admirably when PageRank prioritizes
the results. For the type of full text searches in the main Google system,
PageRank also helps a great deal.
Description of Page Rank Calculation
Academic citation literature has been applied to the web, largely by counting
citations or backlinks to a given page. This gives some approximation of a
page's importance or quality. PageRank extends this idea by not counting
links from all pages equally, and by normalizing by the number of links on a
page. PageRank is defined as follows:
14. Page 14 of 19
We assume page A has pages T1...Tn which point to it (i.e., are citations). The
parameter d
is a damping factor which can be set between 0 and 1. We usually set d to
0.85. There are more details about d in the next section. Also C(A) is defined
as the number of links going out of page A. The PageRank of a page A is
given as follows:
PR (A) = (1-d) + d (PR (T1)/C (T1) + ... + PR (Tn)/C (Tn))
Note that the PageRanks form a probability distribution over web pages, so
the sum of all
Web pages' PageRanks will be one
.
PageRank or PR (A) can be calculated using a simple iterative algorithm, and
corresponds to the principal eigenvector of the normalized link matrix of the
web. Also, a PageRank for 26 million webs pages can be computed in a few
hours on a medium size workstation. There are many other details which are
beyond the scope of this paper.
Anchor Text
The text of links is treated in a special way in our search engine. Most search
engines associate the text of a link with the page that the link is on. In addition,
we associate it with the page the link points to. This has several advantages.
First, anchors often provide more accurate descriptions of web pages than the
pages themselves. Second, anchors may exist for documents which cannot be
indexed by a text based search engine, such as images, programs, and
databases. This makes it possible to return web pages which have not actually
been crawled. Note that pages that have not been crawled can cause
problems, since they are never checked for validity before being returned to
the user. In this case, the search engine can even return a page that never
actually existed, but had hyperlinks pointing to it. However, it is possible to sort
the results, so that this particular problem rarely happens. This idea of
propagating anchor text to the page it refers to was implemented in the World
Wide Web Worm especially because it helps search non-text information, and
expands the search coverage with fewer downloaded documents. We use
anchor propagation mostly because anchor text can help provide better quality
results. Using anchor text efficiently is technically difficult because of the large
amounts of data which must be processed. In our current crawl of 24 million
pages, we had over 259 million anchors which we indexed.
Other Features
Aside from Page Rank and the use of anchor text, Google has several other
features. First, it has location information for all hits and so it makes extensive
use of proximity in search. Second, Google keeps track of some visual
presentation details such as font size of words. Words in a larger or bolder font
15. Page 15 of 19
are weighted higher than other words. Third, full raw HTML of pages is
available in a repository.
System Anatomy
There are some in-depth descriptions of important data structures. Finally, the
major applications like crawling, indexing, and searching will be examined in
depth.
Google Architecture Overview
This is a high level overview of how the whole system works as pictured in
further sections will discuss the applications and data structures not mentioned
in this section. Most of Google is implemented in C or C++ for efficiency and
can run in either Solaris or Linux. In Google, the web crawling (downloading of
web pages) is done by several distributed crawlers. There is an URLserver
that sends lists of URLs to be fetched to the crawlers. The web pages that are
fetched are then sent to the storeserver. The storeserver then compresses
and stores the web pages into a repository. Every web page has an
associated ID number called a docID which is assigned whenever a new URL
is parsed out of a web page. The indexing function is performed by the indexer
and the sorter. The indexer performs a number of functions. It reads the
repository, uncompresses the documents, and parses them. The Anatomy of a
Search Engine converted into a set of word occurrences called hits. The hits
record the word, position in document, an approximation of font size, and
capitalization. The indexer distributes these hits into a set of "barrels", creating
a partially sorted forward index. The indexer performs another important
function. It parses out all the links in every web page and stores important
16. Page 16 of 19
information about them in an anchors file. This file contains enough
information to determine where each link points from and to, and the text of
the link. The URLresolver reads the anchors file and converts relative URLs
into absolute URLs and in turn into docIDs. It puts the anchor text into the
forward index, associated with the docID that the anchor points to. It also
generates a database of links which are pairs of docIDs. The links database is
used to compute PageRanks for all the documents. The sorter takes the
barrels, which are sorted by docID and resorts them by wordID to generate the
inverted index. This is done in place so that little temporary space is needed
for this operation. The sorter also produces a list of wordIDs and offsets into
the inverted index. A program called DumpLexicon takes this list together with
the lexicon produced by the indexer and generates a new lexicon to be used
by the searcher. The searcher is run by a web server and uses the lexicon
built by DumpLexicon together with the inverted index and the PageRanks to
answer queries.
Major Data Structures
Google's data structures are optimized so that a large document collection can
be crawled, indexed, and searched with little cost. Although, CPUs and bulk
input output rates have improved dramatically over the years, a disk seek still
requires about 10 ms to complete. Google is designed to avoid disk seeks
whenever possible, and this has had a considerable influence on the design of
the data structures.
Hit Lists
A hit list corresponds to a list of occurrences of a particular word in a particular
document including position, font, and capitalization information. Hit lists
account for most of the space used in both the forward and the inverted
indices. Because of this, it is important to represent them as efficiently as
possible. We considered several alternatives for encoding position, font, and
capitalization, simple encoding (a triple of integers), a compact encoding (a
hand optimized allocation of bits), and Huffman coding.
Crawling the Web
Running a web crawler is a challenging task. There are tricky performance and
reliability issues and even more importantly, there are social issues. Crawling
is the most fragile application since it involves interacting with hundreds of
thousands of web servers and various name servers which are all beyond the
control of the system. In order to scale to hundreds of millions of web pages,
Google has a fast distributed crawling system. A single URLserver serves lists
of URLs to a number of crawlers. Both the URLserver and the crawlers are
implemented in Python. Each crawler keeps roughly 300 connections open at
once. This is necessary to retrieve web pages at a fast enough pace. At peak
speeds, the system can crawl over 100 web pages per second using four
17. Page 17 of 19
crawlers. This amounts to roughly 600K per second of data. A major
performance stress is DNS lookup. Each crawler maintains a its own DNS
cache so it does not need to do a DNS lookup before crawling each document.
Each of the hundreds of connections can be in a number of different states:
looking up DNS, connecting to host, sending request, and receiving response.
These factors make the crawler a complex component of the system. It uses
asynchronous IO to manage events, and a number of queues to move page
fetches from state to state. It turns out that running a crawler which connects
to more than half a million servers, and generates tens of millions of log
entries generates a fair amount of email and phone calls. Because of the vast
number of people coming on line, there are always those who do not know
what a crawler is, because this is the first one they have seen. Almost daily,
we receive an email something like, "Wow, you looked at a lot of pages from
my web site. How did you like it?" There are also some people who do not
know about the robots exclusion protocol, and think their page should be
protected from indexing by a statement like, "This page is copyrighted and
should not be indexed", which needless to say is difficult for web crawlers to
understand. Also, because of the huge amount of data involved, unexpected
things will happen. For example, our system tried to crawl an online game.
This resulted in lots of garbage messages in the middle of their game! It turns
out this was an easy problem to fix. But this problem had not come up until we
had downloaded tens of millions of pages. Because of the immense variation
in web pages and servers, it is virtually impossible to test a crawler without
running it on large part of the Internet. Invariably, there are hundreds of
obscure problems which may only occur on one page out of the whole web
and cause the crawler to crash, or worse, cause unpredictable or incorrect
behavior. Systems which access large parts of the Internet need to be
designed to be very robust and carefully tested. Since large complex systems
such as crawlers will invariably cause problems, there needs to be significant
resources devoted to reading the email and solving these problems as they
come up.
Indexing the Web
Parsing
Any parser which is designed to run on the entire Web must handle a huge
array of possible errors. These range from typos in HTML tags to kilobytes of
zeros in the middle of a tag, non-ASCII characters, HTML tags nested
hundreds deep, and a great variety of other errors that challenge anyone's
imagination to come up with equally creative ones. For maximum speed,
instead of using YACC to generate a CFG parser, we use flex to generate a
lexical analyzer which we outfit with its own stack. Developing this parser
which runs at a reasonable speed and is very robust involved a fair amount of
work.
18. Page 18 of 19
Indexing Documents into Barrels
After each document is parsed, it is encoded into a number of barrels. Every
word is converted into a wordID by using an in-memory hash table the lexicon.
New additions to the lexicon hash table are logged to a file. Once the words
are converted into wordID's, their occurrences in the current document are
translated into hit lists and are written into the forward barrels. The main
difficulty with parallelization of the indexing phase is that the lexicon needs to
be shared. Instead of sharing the lexicon, we took the approach of writing a
log of all the extra words that were not in a base lexicon, which we fixed at 14
million words. That way multiple indexers can run in parallel and then the small
log file of extra words can be processed by one final indexer.
Sorting
In order to generate the inverted index, the sorter takes each of the forward
barrels and sorts it by wordID to produce an inverted barrel for title and anchor
hits and a full text inverted barrel. This process happens one barrel at a time,
thus requiring little temporary storage. Also, we parallelize the sorting phase to
use as many machines as we have simply by running multiple sorters, which
can process different buckets at the same time. Since the barrels don't fit into
main memory, the sorter further subdivides them into baskets which do fit into
memory based on wordID and docID. Then the sorter loads each basket into
memory, sorts it and writes its contents into the short inverted barrel and the
full inverted barrel.
Searching
The goal of searching is to provide quality search results efficiently. Many of
the large commercial search engines seemed to have made great progress in
terms of efficiency. Therefore, we have focused more on quality of search in
our research, although we believe our solutions are scalable to commercial
volumes with a bit more effort. We are currently investigating other ways to
solve this problem. In the past, we sorted the hits according to
PageRank, which seemed to improve the situation.
A Research
In addition to being a high quality search engine, Google is a research tool.
The data Google has collected has already resulted in many other papers
submitted to conferences and many more on the way. Recent research such
as has shown a number of limitations to queries about the Web that may be
answered without having the Web available locally. This means that Google
(or a similar system) is not only a valuable research tool but a necessary one
for a wide range of applications. We hope Google will be a resource for
searchers and researchers all around the world and will spark the next
generation of search engine technology.
19. Page 19 of 19
Conclusions
Google is designed to be a scalable search engine. The primary goal is to
provide high quality search results over a rapidly growing World Wide Web.
Google employs a number of techniques to improve search quality including
page rank, anchor text, and proximity information. Furthermore, Google is a
complete architecture for gathering web pages, indexing them, and performing
search queries over them.
Future Work
A large-scale web search engine is a complex system and much remains to be
done. Our immediate goals are to improve search efficiency and to scale to
approximately 100 million web pages. Some simple improvements to
efficiency include query caching, smart disk allocation, and subindices.
Another area which requires much research is updates. We must have smart
algorithms to decide what old web pages should be recrawled and what new
ones should be crawled. Work toward this goal has been done. We are
planning to add simple features supported by commercial search engines like
boolean operators, negation, and stemming. However, other features are just
starting to be explored such as relevance feedback and clustering (Google
currently supports a simple hostname based clustering). We are also working
to extend the use of link structure and link text. Simple experiments indicate
PageRank can be personalized by increasing the weight of a user's home
page or bookmarks. Web search engine is a very rich environment for
research ideas. We have far too many to list here so we do not expect this
Future Work section to become much shorter in the near future.
High Quality Search
The biggest problem facing users of web search engines today is the quality of
the results they get back. While the results are often amusing and expand
user’s horizons, they are often frustrating and consume precious time. For
example, the top result for a search for "Bill Clinton" on one of the most
popular commercial search engines was the Bill Clinton Joke of the Day: April
14, 1997. Google is designed to provide higher quality search so as the Web
continues to grow rapidly, information can be found easily. In order to
accomplish this Google makes heavy use of hypertextual information
consisting of link structure and link (anchor) text. Google also uses proximity
and font information. While evaluation of a search engine is difficult, we have
subjectively found that Google returns higher quality search results than
current commercial search engines. The analysis of link structure via
PageRank allows Google to evaluate the quality of web pages. The use of link
text as a description of what the link points to helps the search engine return
relevant (and to some degree high quality) results. Finally, the use of proximity
information helps increase relevance a great deal for many queries.