Search engine

1[http://en.wikipedia.org/wiki/Search_engine_optimization]

Search engine optimization (SEO) is the process of improving the visibility of a website or
a webpage in search engines via the "natural" or un-paid ("organic" or "algorithmic") search
results. Additional Search engine marketing (SEM) methods including paid listings may
achieve higher effectivity. In general, the earlier (or higher on the page), and more frequently
a site appears in the search results list, the more visitors it will receive from the search
engine's users. SEO may target different kinds of search, including image search, local
search, video search, academic search, news search and industry-specific vertical search
engines. This gives a website web presence.

As an Internet marketing strategy, SEO considers how search engines work, what people
search for, the actual search terms typed into search engines and which search engines are
preferred by their targeted audience. Optimizing a website may involve editing its content
and HTML and associated coding to both increase its relevance to specific keywords and to
remove barriers to the indexing activities of search engines. Promoting a site to increase the
number of backlinks, or inbound links, is another SEO tactic.

The acronym "SEOs" can refer to "search engine optimizers," a term adopted by an industry
of consultants who carry out optimization projects on behalf of clients, and by employees
who perform SEO services in-house. Search engine optimizers may offer SEO as a stand-
alone service or as a part of a broader marketing campaign. Because effective SEO may
require changes to the HTML source code of a site and site content, SEO tactics may be
incorporated into website development and design. The term "search engine friendly" may be
used to describe website designs, menus, content management systems, images, videos,
shopping carts, and other elements that have been optimized for the purpose of search engine
exposure.

Another class of techniques, known as black hat SEO, search engine poisoning, or
spamdexing, uses methods such as link farms, keyword stuffing and article spinning that
degrade both the relevance of search results and the quality of user-experience with search
engines. Search engines look for sites that employ these techniques in order to remove them
from their indices.

Search engine optimization methods are techniques used by webmasters to get more
visibility for their sites in search engine results pages.

•

Getting indexed
The leading search engines, such as Google Bing and Yahoo, use crawlers to find pages for
their algorithmic search results. Pages that are linked from other search engine indexed pages
do not need to be submitted because they are found automatically. Some search engines,
notably Yahoo!, operate a paid submission service that guarantees crawling for either a set
fee or cost per click. Such programs usually guarantee inclusion in the database, but do not
guarantee specific ranking within the search results. Two major directories, the Yahoo
Directory and the Open Directory Project both require manual submission and human
editorial review.[2] Google offers Google Webmaster Tools, for which an XMLSitemap feed

can be created and submitted for free to ensure that all pages are found, especially pages that
aren't discoverable by automatically following links.[3]

Search engine crawlers may look at a number of different factors when crawling a site. Not
every page is indexed by the search engines. Distance of pages from the root directory of a
site may also be a factor in whether or not pages get crawled.

Other methods
A variety of other methods are employed to get a webpage indexed and shown higher in the
results and often a combination of these methods are used as part of a search engine
optimization campaign.

• Cross linking between pages of the same website. Giving more links to main pages of
the website, to increase Page Rank used by search engines. Linking from other
websites, including link farming and comment spam.
• Keyword rich text in the webpage and key phrases, so as to match all search queries.[7]
Adding relevant keywords to a web page meta tags, including keyword stuffing.
• URL normalization for webpages with multiple urls, using "canonical" meta tag.[8]
• A backlink from a Web directory.
• SEO Trending based on recent search behaviour using tools like Google Insights for
Search.
• Media Content creation like press releases and online news letters to generate an
amount of incoming links

Content Creation and Linking
Content creation is one of the primary focuses of any SEO's job. Without unique, relevant,
and easily scannable content users tend to spend little to no time paying attention to a
website. Almost all SEOs that provide organic search improve focus heavily on creating this
type of content, or "linkbait". Linkbait is a term used to describe content that is designed to
be shared and replicated virally in an effort to gain backlinks.

Often, webmasters and content administrators create blogs to easily provide this information
through a method that is intrinsically viral. However, most forget that traffic generated to
blog accounts don't point back to their respective domains, so they lose "link juice". Link
juice is jargon for links that provide a boost to Page Rank and Trust Rank. Changing the
domain of the blog, to a subdomain of the respective domain is a quick way to combat this
siphoning of link juice .

Other commonly implemented methodologies for creating and disseminating content include
YouTube Videos, Google Places accounts, as well as Picasa and Flickr photos indexed in
Google Images Searches. These additional forms of content allow webmasters to produce
content that ranks well in the world's second most popular search engine - YouTube, in
addition to appearing in organic search results.

Gray hat techniques

Gray hat techniques are those that are neither really white nor black hat. Some of these gray
hat techniques may be argued either way. These techniques might have some risk associated
with them. A very good example of such a technique is purchasing links. The average price
for a text link depends on the perceived authority of the linking page. The authority is
sometimes measured by Google's PageRank, although this is not necessarily an accurate way
of determining the importance of a page.

While Google is against sale and purchase of links there are people who subscribe to online
magazines, memberships and other resources for the purpose of getting a link back to their
website.

Another widely used gray hat technique is a webmaster creating multiple 'micro-sites' which
he or she controls for the sole purpose of cross linking to the target site. Since it is the same
owner of all the micro-sites, this is a violation of the principles of the search engine's
algorithms (by self-linking) but since ownership of sites is not traceable by search engines it
is impossible to detect and therefore they can appear as different sites, especially when using
separate Class-C IPs.

In computing, spamdexing (also known as search spam, search engine spam, web spam or
Search Engine Poisoning is the deliberate manipulation of search engine indexes It involves
a number of methods, such as repeating unrelated phrases, to manipulate the relevance or
prominence of resources indexed in a manner inconsistent with the purpose of the indexing
system. Some consider it to be a part of search engine optimization, though there are many
search engine optimization methods that improve the quality and appearance of the content of
web sites and serve content useful to many users. Search engines use a variety of algorithms
to determine relevancy ranking. Some of these include determining whether the search term
appears in the META keywords tag, others whether the search term appears in the body Text
or URL of a web page. Many search engines check for instances of spamdexing and will
remove suspect pages from their indexes. Also, people working for a search-engine
organization can quickly block the results-listing from entire websites that use spamdexing,
perhaps alerted by user complaints of false matches. The rise of spamdexing in the mid-1990s
made the leading search engines of the time less useful.

Common spamdexing techniques can be classified into two broad classes: content spam (or
term spam) and link spam.

Content spam
These techniques involve altering the logical view that a search engine has over the page's
contents. They all aim at variants of the vector space model for information retrieval on text
collections.

Keyword stuffing

Keyword stuffing involves the calculated placement of keywords within a page to raise the
keyword count, variety, and density of the page. This is useful to make a page appear to be
relevant for a web crawler in a way that makes it more likely to be found. Example: A

promoter of a Ponzi scheme wants to attract web surfers to a site where he advertises his
scam. He places hidden text appropriate for a fan page of a popular music group on his page,
hoping that the page will be listed as a fan site and receive many visits from music lovers.
Older versions of indexing programs simply counted how often a keyword appeared, and
used that to determine relevance levels. Most modern search engines have the ability to
analyze a page for keyword stuffing and determine whether the frequency is consistent with
other sites created specifically to attract search engine traffic. Also, large webpages are
truncated, so that massive dictionary lists cannot be indexed on a single webpage.

Hidden or invisible text

Unrelated hidden text is disguised by making it the same color as the background, using a
tiny font size, or hiding it within HTML code such as "no frame" sections, alt attributes, zero-
sized DIVs, and "no script" sections. People screening websites for a search-engine company
might temporarily or permanently block an entire website for having invisible text on some of
its pages. However, hidden text is not always spamdexing: it can also be used to enhance
accessibility.

Meta-tag stuffing

This involves repeating keywords in the Meta tags, and using meta keywords that are
unrelated to the site's content. This tactic has been ineffective since 2005.

Doorway pages

"Gateway" or doorway pages are low-quality web pages created with very little content but
are instead stuffed with very similar keywords and phrases. They are designed to rank highly
within the search results, but serve no purpose to visitors looking for information. A doorway
page will generally have "click here to enter" on the page.

scraper sites

Scraper sites sites, are created using various programs designed to "scrape" search-engine
results pages or other sources of content and create "content" for a website. The specific
presentation of content on these sites is unique, but is merely an amalgamation of content
taken from other sources, often without permission. Such websites are generally full of
advertising (such as pay-per-click ads), or they redirect the user to other sites. It is even
feasible for scraper sites to outrank original websites for their own information and
organization names.

Article spinning

Article spinning involves rewriting existing articles, as opposed to merely scraping content
from other sites, to avoid penalties imposed by search engines for duplicate content. This
process is undertaken by hired writers or automated using a thesaurus database or a neural
network.

Link spam

Link spam is defined as links between pages that are present for reasons other than merit.
Link spam takes advantage of link-based ranking algorithms, which gives websites higher
rankings the more other highly ranked websites link to it. These techniques also aim at
influencing other link-based ranking techniques such as the HITS algorithm.

Link-building software

A common form of link spam is the use of link-building software to automate the search
engine optimization process.

Link farms

Link farms are tightly-knit communities of pages referencing each other, also known
humorously as mutual admiration societies.

Hidden links

Putting hyperlinks where visitors will not see them to increase link popularity. Highlighted
link text can help rank a webpage higher for matching that phrase.

Sybil attack

A Sybil attack is the forging of multiple identities for malicious intent, named after the
famous multiple personality disorder patient "Sybil" (Shirley Ardell Mason). A spammer may
create multiple web sites at different domain names that all link to each other, such as fake
blogs (known as spam blogs).

Spam blogs

Spam blogs are blogs created solely for commercial promotion and the passage of link
authority to target sites. Often these "splogs" are designed in a misleading manner that will
give the effect of a legitimate website but upon close inspection will often be written using
spinning software or very poorly written and barely readable content. They are similar in
nature to link farms.

Page hijacking

Page hijacking is achieved by creating a rogue copy of a popular website which shows
contents similar to the original to a web crawler but redirects web surfers to unrelated or
malicious websites.

Buying expired domains

Some link spammers monitor DNS records for domains that will expire soon, then buy them
when they expire and replace the pages with links to their pages. See Domaining. However
Google resets the link data on expired domains. Some of these techniques may be applied for
creating a Google bomb, this is, to cooperate with other users to boost the ranking of a
particular page for a particular query.

Cookie stuffing

Cookie stuffing involves placing an affiliate tracking cookie on a website visitor's computer
without their knowledge, which will then generate revenue for the person doing the cookie
stuffing. This not only generates fraudulent affiliate sales, but also has the potential to
overwrite other affiliates' cookies, essentially stealing their legitimately earned commissions.

] Using world-writable pages
Main article: forum spam

Web sites that can be edited by users can be used by spandexes to insert links to spam sites if
the appropriate anti-spam measures are not taken.

Automated spam bots can rapidly make the user-editable portion of a site unusable.
Programmers have developed a variety of automated spam prevention techniques to block or
at least slow down spam bots.

Spam in blogs

Spam in blogs is the placing or solicitation of links randomly on other sites, placing a desired
keyword into the hyperlinked text of the inbound link. Guest books, forums, blogs, and any
site that accepts visitors' comments are particular targets and are often victims of drive-by
spamming where automated software creates nonsense posts with links that are usually
irrelevant and unwanted.

Comment spam

Comment spam is a form of link spam that has arisen in web pages that allow dynamic user
editing such as wikis, blogs, and guest books. It can be problematic because agents can be
written that automatically randomly select a user edited web page, such as a Wikipedia
article, and add spamming links.

Wiki spam

Wiki spam is a form of link spam on wiki pages. The spammer uses the open edit ability of
wiki systems to place links from the wiki site to the spam site. The subject of the spam site is
often unrelated to the wiki page where the link is added. In early 2005, Wikipedia
implemented a default "no follow" value for the "rel" HTML attribute. Links with this
attribute are ignored by Google's Page Rank algorithm. Forum and Wiki admins can use these
to discourage Wiki spam.

Referrer log spamming

Referrer spam takes place when a spam perpetrator or facilitator accesses a web page (the
referee), by following a link from another web page (the referrer), so that the referee is given
the address of the referrer by the person's Internet browser. Some websites have a referrer log
which shows which pages link to that site. By having a robot randomly access many sites
enough times, with a message or specific address given as the referrer, that message or
Internet address then appears in the referrer log of those sites that have referrer logs. Since

some Web search engines base the importance of sites on the number of different sites
linking to them, referrer-log spam may increase the search engine rankings of the spammer's
sites. Also, site administrators who notice the referrer log entries in their logs may follow the
link back to the spammer's referrer page.

2[http://www.webconfs.com/seo-tutorial/introduction-to-seo.php]

Whenever you enter a query in a search engine and hit 'enter' you get a list of web results that
contain that query term. Users normally tend to visit websites that are at the top of this list as
they perceive those to be more relevant to the query. If you have ever wondered why some of
these websites rank better than the others then you must know that it is because of a powerful
web marketing technique called Search Engine Optimization (SEO).

SEO is a technique which helps search engines find and rank your site higher than the
millions of other sites in response to a search query. SEO thus helps you get traffic from
search engines.

This SEO tutorial covers all the necessary information you need to know about Search
Engine Optimization - what is it, how does it work and differences in the ranking criteria of
major search engines.

1. How Search Engines Work

The first basic truth you need to know to learn SEO is that search engines are not humans.
While this might be obvious for everybody, the differences between how humans and search
engines view web pages aren't. Unlike humans, search engines are text-driven. Although
technology advances rapidly, search engines are far from intelligent creatures that can feel the
beauty of a cool design or enjoy the sounds and movement in movies. Instead, search engines
crawl the Web, looking at particular site items (mainly text) to get an idea what a site is
about. This brief explanation is not the most precise because as we will see next, search
engines perform several activities in order to deliver search results – crawling, indexing,
processing, calculating relevancy, and retrieving.

First, search engines crawl the Web to see what is there. This task is performed by a piece of
software, called a crawler or a spider (or Googlebot, as is the case with Google). Spiders
follow links from one page to another and index everything they find on their way. Having in
mind the number of pages on the Web (over 20 billion), it is impossible for a spider to visit a
site daily just to see if a new page has appeared or if an existing page has been modified,
sometimes crawlers may not end up visiting your site for a month or two.

What you can do is to check what a crawler sees from your site. As already mentioned,
crawlers are not humans and they do not see images, Flash movies, JavaScript, frames,
password-protected pages and directories, so if you have tons of these on your site, you'd
better run the Spider Simulator below to see if these goodies are viewable by the spider. If
they are not viewable, they will not be spidered, not indexed, not processed, etc. - in a word
they will be non-existent for search engines.

After a page is crawled, the next step is to index its content. The indexed page is stored in a
giant database, from where it can later be retrieved. Essentially, the process of indexing is
identifying the words and expressions that best describe the page and assigning the page to
particular keywords. For a human it will not be possible to process such amounts of
information but generally search engines deal just fine with this task. Sometimes they might
not get the meaning of a page right but if you help them by optimizing it, it will be easier for
them to classify your pages correctly and for you – to get higher rankings.

When a search request comes, the search engine processes it – i.e. it compares the search
string in the search request with the indexed pages in the database. Since it is likely that more
than one page (practically it is millions of pages) contains the search string, the search engine
starts calculating the relevancy of each of the pages in its index with the search string.

There are various algorithms to calculate relevancy. Each of these algorithms has different
relative weights for common factors like keyword density, links, or metatags. That is why
different search engines give different search results pages for the same search string. What is
more, it is a known fact that all major search engines, like Yahoo!, Google, Bing, etc.
periodically change their algorithms and if you want to keep at the top, you also need to adapt
your pages to the latest changes. This is one reason (the other is your competitors) to devote
permanent efforts to SEO, if you'd like to be at the top.

The last step in search engines' activity is retrieving the results. Basically, it is nothing more
than simply displaying them in the browser – i.e. the endless pages of search results that are
sorted from the most relevant to the least relevant sites.

Indexing

Search engine indexing collects, parses, and stores data to facilitate fast and accurate
information retrieval. Index design incorporates interdisciplinary concepts from linguistics,
cognitive psychology, mathematics, informatics, physics and computer science. An alternate
name for the process in the context of search engines designed to find web pages on the
Internet is Web indexing.

Popular engines focus on the full-text indexing of online, natural language documents. Media
types such as video and audio and graphics are also searchable.

Meta search engines reuse the indices of other services and do not store a local index,
whereas cache-based search engines permanently store the index along with the corpus.
Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size.
Larger services typically perform indexing at a predetermined time interval due to the
required time and processing costs, while agent-based search engines index in real time.

1 Search engine architecture

[http://www.ibm.com/]

Architecture overview

The architecture of a common Web search engine contains a front-end process and a back-
end process, as shown in Figure 1. In the front-end process, the user enters the search words
into the search engine interface, which is usually a Web page with an input box. The
application then parses the search request into a form that the search engine can understand,
and then the search engine executes the search operation on the index files. After ranking, the
search engine interface returns the search results to the user. In the back-end process, a spider
or robot fetches the Web pages from the Internet, and then the indexing subsystem parses the
Web pages and stores them into the index files. If you want to use Lucene to build a Web
search application, the final architecture will be similar to that shown in Figure 1.

Figure 1. Web search engine architecture

Implement advanced search with Lucene

Lucene supports several kinds of advanced searches, which I'll discuss in this section. I'll then
demonstrate how to implement these searches with Lucene's Application Programming
Interfaces (APIs).

Boolean operators

Most search engines provide Boolean operators so users can compose queries. Typical
Boolean operators are AND, OR, and NOT. Lucene provides five Boolean operators: AND,
OR, NOT, plus (+), and minus (-). I'll describe each of these operators.

• OR: If you want to search for documents that contain the words "A" or "B," use the
OR operator. Keep in mind that if you don't put any Boolean operator between two
search words, the OR operator will be added between them automatically. For
example, "Java OR Lucene" and "Java Lucene" both search for the terms "Java" or
"Lucene."
• AND: If you want to search for documents that contain more than one word, use the
AND operator. For example, "Java AND Lucene" returns all documents that contain
both "Java" and "Lucene."
• NOT: Documents that contain the search word immediately after the NOT operator
won't be retrieved. For example, if you want to search for documents that contain
"Java" but not "Lucene," you may use the query "Java NOT Lucene." You cannot use
this operator with only one term. For example, the query "NOT Java" returns no
results.
• +: The function of this operator is similar to the AND operator, but it only applies to
the word immediately following it. For example, if you want to search documents that
must contain "Java" and may contain "Lucene," you can use the query "+Java
Lucene."
• -: The function of this operator is the same as the NOT operator. The query "Java
-Lucene" returns all of the documents that contain "Java" but not "Lucene."

Now look at how to implement a query with Boolean operators using Lucene's API. Listing 1
shows the process of doing searches with Boolean operators.

Field search

Lucene supports field search. You can specify the fields that a query will be executed on. For
example, if your document contains two fields, Title and Content, you can use the query
"Title: Lucene AND Content: Java" to search for documents that contain the term "Lucene"
in the Title field and "Java" in the Content field. Listing 2 shows how to use Lucene's API to
do a field search.

Wildcard search

Lucene supports two wildcard symbols: the question mark (?) and the asterisk (*). You can
use ? to perform a single-character wildcard search, and you can use * to perform a multiple-

character wildcard search. For example, if you want to search for "tiny" or "tony," you can
use the query "t?ny," and if you want to search for "Teach," "Teacher," and "Teaching," you
can use the query "Teach*." Listing 3 demonstrates the process of doing a wildcard search.

Fuzzy search

Lucene provides a fuzzy search that's based on an edit distance algorithm. You can use the
tilde character (~) at the end of a single search word to do a fuzzy search. For example, the
query "think~" searches for the terms similar in spelling to the term "think." Listing 4
features sample code that conducts a fuzzy search with Lucene's API.

Range search

A range search matches the documents whose field values are in a range. For example, the
query "age:[18 TO 35]" returns all of the documents with the value of the "age" field between
18 and 35. Listing 5 shows the process of doing a range search with Lucene's API.

2 Searching a small national domain--Preliminary report
András A. Benczúr Károly Csalogány Dániel Fogaras Eszter Friedman
Tamás Sarlós Máté Uher Eszter Windhager
Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI)
11 Lagymanyosi u., H--1111 Budapest, Hungary
Eötvös University, Budapest, and Budapest University of Technology and Economics
{benczur,cskaresz,fd,feszter,stamas,umate,hexapoda}@ilab.sztaki.hu
http://www.ilab.sztaki.hu/websearch

ABSTRACT

Small languages represent a non-negligible portion of the Web with interest for a large
population with less literacy in English. Existing search engine solutions however vary in
quality mostly because a few of these languages have a particularly complicated syntax that
requires communication between linguistic tools and "classical" Web search techniques. In
this paper we present development stage experiments of a search engine for the .hu or other
similar national domains. Such an engine differs in several design issues from large-scale
engines; as an example we apply efficient crawling and indexing policies that may enable
breaking news search.

Keywords

Web crawling, search engines, database freshness, refresh policies.

1. INTRODUCTION

While search engines face the challenge of billions of Web pages with content in English or
some other major language, small languages such as the Hungarian represent a quantity sizes
of magnitude smaller. A few of the small languages have particularly complicated syntax--
most notably Hungarian is one of the languages that troubles a search engine designer the
most by requiring complex interaction between linguistic tools and ranking methods.

Searching such a small national domain is of commercial interest for a non-negligible
population; existing solutions vary in quality. Despite of its importance the design,
experiments or benchmark tests about small-range national search engine development only
exceptionally appear in the literature.

A national domain such as the .hu covers a moderate size of not much more than ten million
HTML pages and concentrates most of the national language documents. The size,
concentration and network locality allows an engine to run on an inexpensive architecture
while it may by orders of magnitude outperform large-scale engines in keeping the index up
to date. In our experiments we use a refresh policy that may fit for the purpose of a news
agent. Our architecture will hence necessarily differ from those appearing in published
experiments or benchmark tests at several points.

In this paper we present development stage experiments of a search engine for the .hu or
other similar domain that may form the base of a regional or a distributed engine. Peculiar to
search over small domains is our a refresh policy and the extensive use of clustering methods.
We revisit documents in a more sophisticated way than in [13]; the procedure is based on our
experiments that show a behavior different from [7] over the Hungarian web. Our method for
detecting domain boundaries improves on [8] by using hyperlink graph clustering to aid
boundary detection. We give a few examples that show the difficulty of Web information
retrieval in agglutinative and multi-language environments; natural language processing in
more detail is however not addressed in this work.

2. ARCHITECTURE

The high-level architecture of a search engine is described in various surveys [2, 4, 13]. Since
we only crawl among a few tens of millions of pages, our engine also contains a mixture of a
news agent and a focused crawler. Similar to the extended crawl model of [5], we include
long-term and short-term crawl managers. The manager determines the long-term schedule
refresh policy based on PageRank [2] information. Our harvester (modified Larbin 2.6.2 [1])
provides short-term schedule and also serves as a seeder by identifying new URLs.
Eventually all new URLs are visited or banned by site-specific rules. We apply a two-level
indexer (BerkeleyDB 3.3.11 [12]) based on [11] that updates a temporary index in frequent
batches; temporary index items are merged into permanent index at off-the-peak times and at
certain intervals the entire index is recompiled.

Due to space limitations a number of implementation issues is described in the full paper
such as the modifications of the crawler, its interface to long-term refresh scheduler as well as
indexing architecture and measurements.

Figure 1: Search engine architecture

3. INFORMATION RETRIEVAL IN AN AGGLUTINATIVE AND MULTI-LANGUAGE
ENVIRONMENT

Hungarian is one of the languages that troubles a search engine designer with lost hits or
severe topic drifts when the natural language issues are not properly handled and integrated
into all levels of the engine itself. We found single words with over 300 different word form
occurrences in Web documents (Figure 2); in theory this number may be as large as ten
thousand while compound rules are near as permissive as in German. Word forms of stop
words may or may not be stop words themselves. Technical documents are often filled with
English words and documents may be bi- or multilingual. Stemming is expensive and thus,
unlike in classical IR, we stem in large batches instead of at each word occurrence.

Figure 2: number of word stems with given number of word form occurrences in 5 million
Hungarian language documents

Polysemy often confuses search results and makes top hits irrelevant to the query. "Java", the
most frequently cited example of a word with several meanings, in Hungarian also means
"majority of", "its belongings", "its goods", "its best portion", a type of pork, and may also be
incorrectly identified as an agglutination of a frequent abbreviation in mailing lists.
Synonyms "dog" and "hound" ("kutya" and "eb") occur nonexchangeably in compounds such
as "hound breeders" and "dog shows" while one of them also stands for the abbreviation of
European Championship. Or the name of a widely used data mining software "sas" translates
back to English as "eagle" that in addition frequently occurs in Hungarian name compounds.

In order to achieve acceptable precision in Web information retrieval for multilingual
environments or agglutinative languages with a very large number of word forms we need to
consider a large number of mixed linguistic, ranking and IR issues.

• We index stems at appropriate levels as for example in
(((áll)am)((((ad)ó)s)ság)) = ((stand)state)((((give)tax)debtor)debt)
we might not even want to stem at all. Too deep stemming causes topic drift; weak
stemming suffices since relevant hits are expected to contain the query word several times
and among its forms the stem is likely to occur.

• Efficient phrase search is nontrivial on its own [3, 13], we also need the original word form in
the index for this task.
• In order to rank the relevance of an index entry we may need to know the syntax of the
sentence that contains the word or use document word frequency clustering results.
• Document language(s) and possible missing accents (typical in mailing list archives) must
also be taken into consideration or else the index term easily changes meaning (beer, for
example, becomes queue with no accents--"sör" and "sor").
• Translations of the stems and forms between Hungarian and English helps ranking
algorithms that use anchor text information.
• All above issues tend to increase index granularity and the amount of additional information
stored and thus space requirement must be carefully optimized.

4. RANKING AND DOMAIN CLUSTERING

A unique possibility in searching a moderate size domain is the extensive and relative
inexpensive application of clustering methods. As a key application we determine coherent
domains or multi-part documents. As noticed by Davison [8], hyperlink analysis yields
accurate quality measures only if applied to a higher level link structure of inter-domain
links. The URL text analysis method [8] however often fails for the .hu domain, resulting in
unfair PageRank values for a large number of sites without using our clustering method.

5. REFRESH POLICY

Our refresh policy is based on the observed refresh time of the page and its PageRank [2]. We
extend the argument of Cho et al. [6] by weighting freshness measures by functions of the
PageRank pr: given the refresh time refr, we are looking for sync, the synchronization time
function over pages that maximize PageRank times expected freshness. The optimum
solution of the system can be obtained by solving

pr * (refr - (sync + refr) exp (-sync / refr)) = u

where we let Lagrange multiplier u be maximum such that the download capacity constraint is not
exceeded. We compute an approximate solution by increasing refresh rate in discrete steps for
documents with minimum current u value. The number of equations are reduced by discretizing
PageRank and frequency values.

We propose another fine tuning for efficiently crawling breaking news. Documents that need
not be revisited every day may be safely scheduled for off-peak hours, thus during daytime
we concentrate on news sites, sites with high PageRank and quick changes. At off-peak hours
however, frequent visits to news portals are of little use and priorities may be modified
accordingly.

6. EXPERIMENTS

We estimate not much more than ten million "interesting" pages of the .hu domain residing at
an approximate 300,000 Web sites. In the current experiment we crawled five million; in
order to obtain a larger collection of valuable documents we need refined site-specific control
over in-site depth and following links that pass arguments.

Among 4.7 million files of size over 30 bytes, depending on its settings the language guesser
finds Hungarian language 70-90% while English 27-34% of the time (pages may be
multilingual or different incorrect guesses may be made at different settings). Outside the .hu
we found an additional 280,000 pages, mostly in Hungarian, by simple stop word heuristics;
more refined language guessing turned out too expensive for this task.

We conducted a preliminary measurement for the lifetime of HTML documents. Unlike
suggested by [7], our results in Figure 3 do not confirm a relation of refresh rates and the
PageRank of any kind. Hence we use the computed optimal schedule based on observed
refresh rate and the PageRank; in this schedule documents of a given rank are reloaded more
frequently as their lifetime decreases until a certain point where freshness updates are quickly
given up.

Figure 3: Scatterplot of PageRank and average lifetime (in minutes) of 3 million pages of
the .hu domain. In order to amplify low rank pages we show 1/Pagerank on the horizontal
axis.

We found that pre-parsed document text as in [10] provides good indication of a content
update. Portals and news sites should in fact be revisited by more sophisticated rules: for
example only a few index.html's need to be recrawled and links must be followed in order to

keep the entire database fresh. Methods for indexing only content and not the automatically
generated navigational and advertisement blocks are given in [9].

7. ACKNOWLEDGMENTS

To MorphoLogic Inc. for providing us with their language guesser (LangWitch) and their
Hungarian stemming module (HelyesLem).

Research was supported from NKFP-2/0017/2002 project Data Riddle and various ETIK,
OTKA and AKP grants.

3 Architecture of a Search Engine - Components and Process

[http://www.beatgoogleusa.com/]

Any internet search engine consists of two major parts - a Back End Database (server side)
and a GUI (client side)...

Basic Search Engine (architecture)

...to facilitate the user to type the search term. On the server side, the process involves creation of a
database and its periodic updating done by a software called Spider. The spider "crawls" the URL
pages periodically and indexes the crawled pages in the database. The hyperlinked nature of the
Internet makes it possible for the spider to traverse the web. The interface between the client and
server side consists of matching the posted query with the entries in the database and retrieving the
matched URLs to the user's machine.
The spider crawls the web pages through the hyperlinks. In this process it extracts the 'title',

'keywords', and any other related information needed for the database from the HTML document.
Sometimes, the entire content of the HTML document, (but for the stop words - very common words
such as for, is etc.), is extracted and indexed in the database. This is based on the idea that a page
dealing with a particular issue will have relevant words throughout its page. Thus indexing all the
words in a document increases the probability of getting the relevant URLs to a query. One point is
worth noting here: before the query words are processed they are removed of the morphological
inflections before they are searched for in the database. The spider is also referred to by names:
"Robot", "Crawler", "Indexer" etc. The database consists of a number of tables arranged to aid in
quick retrieval of the data. With the number of sites increasing it is common for search engines to
maintain more than one database server. When the user queries for term(s), these particular term(s)
is(are) searched in the database. The sites in which these term(s) are present are identified. Then
these sites are ranked on the basis of the relevancy they have with the user query. The ranked sites
are then displayed, with links to these sites and a small description taken from the site itself so as to
give an idea to the user about the site.
Five key building blocks of a crawling search engine -
CRAWLER (or ROBOT) - a specialised automated program that follows links found on web pages, and
directs the spider where to go next by finding new sites for it to visit. When you add your URL to a
Search Engine, it is the crawler you are requesting to visit your site.
SPIDER (or ROBOT) - an automatic browser-like program that downloads documents found on the
web by the crawler. It works very much as a browser does when it connects to a website and
downloads pages. Most spiders aren't interested in images though, and don't ask for them to be
sent.
INDEXER - a program that "reads" the pages that are downloaded by spiders. This does most of the
work deciding what your site is about. The words in the site are "read". Some are thrown away, as
they are so common (and, it, the etc). It will also examine the HTML code which makes up your site
looking for other clues as to which words you consider to be important. Words in bold, italic or
headers tags will be given more weight. This is also where the meta information (the keywords and
description tags) for your site will be analysed.
DATABASE - index for storage of the pages downloaded and processed. It is where the information
gathered by the indexer is stored.
RESULTS ENGINE - generates search results out of the database, according to your query. This is the
most important part of any Search Engine. The results engine is the customer-facing (UI) portion of a
Search Engine, and as such is the focus of most optimisation efforts. It is the results engine's function
to return the pages most relevant to a users query. When a user types in a keyword or phrase, the
results engine must decide which pages are most likely to be useful to the user. The method it uses
to decide that is called its "algorithm". You may hear Search Engine Optimisation (SEO) experts
discuss "algos" or "breaking the algo" for a particular search engine. After all, if you know what the
criteria being used are, you can write pages to take advantage of them.

Search engine

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Search engine

Similar to Search engine (20)

Recently uploaded

Recently uploaded (20)

Search engine