SlideShare a Scribd company logo
1 of 19
1[http://en.wikipedia.org/wiki/Search_engine_optimization]

Search engine optimization (SEO) is the process of improving the visibility of a website or
a webpage in search engines via the "natural" or un-paid ("organic" or "algorithmic") search
results. Additional Search engine marketing (SEM) methods including paid listings may
achieve higher effectivity. In general, the earlier (or higher on the page), and more frequently
a site appears in the search results list, the more visitors it will receive from the search
engine's users. SEO may target different kinds of search, including image search, local
search, video search, academic search, news search and industry-specific vertical search
engines. This gives a website web presence.

As an Internet marketing strategy, SEO considers how search engines work, what people
search for, the actual search terms typed into search engines and which search engines are
preferred by their targeted audience. Optimizing a website may involve editing its content
and HTML and associated coding to both increase its relevance to specific keywords and to
remove barriers to the indexing activities of search engines. Promoting a site to increase the
number of backlinks, or inbound links, is another SEO tactic.

The acronym "SEOs" can refer to "search engine optimizers," a term adopted by an industry
of consultants who carry out optimization projects on behalf of clients, and by employees
who perform SEO services in-house. Search engine optimizers may offer SEO as a stand-
alone service or as a part of a broader marketing campaign. Because effective SEO may
require changes to the HTML source code of a site and site content, SEO tactics may be
incorporated into website development and design. The term "search engine friendly" may be
used to describe website designs, menus, content management systems, images, videos,
shopping carts, and other elements that have been optimized for the purpose of search engine
exposure.

Another class of techniques, known as black hat SEO, search engine poisoning, or
spamdexing, uses methods such as link farms, keyword stuffing and article spinning that
degrade both the relevance of search results and the quality of user-experience with search
engines. Search engines look for sites that employ these techniques in order to remove them
from their indices.

Search engine optimization methods are techniques used by webmasters to get more
visibility for their sites in search engine results pages.

•


Getting indexed
The leading search engines, such as Google Bing and Yahoo, use crawlers to find pages for
their algorithmic search results. Pages that are linked from other search engine indexed pages
do not need to be submitted because they are found automatically. Some search engines,
notably Yahoo!, operate a paid submission service that guarantees crawling for either a set
fee or cost per click. Such programs usually guarantee inclusion in the database, but do not
guarantee specific ranking within the search results. Two major directories, the Yahoo
Directory and the Open Directory Project both require manual submission and human
editorial review.[2] Google offers Google Webmaster Tools, for which an XMLSitemap feed
can be created and submitted for free to ensure that all pages are found, especially pages that
aren't discoverable by automatically following links.[3]

Search engine crawlers may look at a number of different factors when crawling a site. Not
every page is indexed by the search engines. Distance of pages from the root directory of a
site may also be a factor in whether or not pages get crawled.

Other methods
A variety of other methods are employed to get a webpage indexed and shown higher in the
results and often a combination of these methods are used as part of a search engine
optimization campaign.

   •   Cross linking between pages of the same website. Giving more links to main pages of
       the website, to increase Page Rank used by search engines. Linking from other
       websites, including link farming and comment spam.
   •   Keyword rich text in the webpage and key phrases, so as to match all search queries.[7]
       Adding relevant keywords to a web page meta tags, including keyword stuffing.
   •   URL normalization for webpages with multiple urls, using "canonical" meta tag.[8]
   •   A backlink from a Web directory.
   •   SEO Trending based on recent search behaviour using tools like Google Insights for
       Search.
   •   Media Content creation like press releases and online news letters to generate an
       amount of incoming links

Content Creation and Linking
Content creation is one of the primary focuses of any SEO's job. Without unique, relevant,
and easily scannable content users tend to spend little to no time paying attention to a
website. Almost all SEOs that provide organic search improve focus heavily on creating this
type of content, or "linkbait". Linkbait is a term used to describe content that is designed to
be shared and replicated virally in an effort to gain backlinks.

Often, webmasters and content administrators create blogs to easily provide this information
through a method that is intrinsically viral. However, most forget that traffic generated to
blog accounts don't point back to their respective domains, so they lose "link juice". Link
juice is jargon for links that provide a boost to Page Rank and Trust Rank. Changing the
domain of the blog, to a subdomain of the respective domain is a quick way to combat this
siphoning of link juice .

Other commonly implemented methodologies for creating and disseminating content include
YouTube Videos, Google Places accounts, as well as Picasa and Flickr photos indexed in
Google Images Searches. These additional forms of content allow webmasters to produce
content that ranks well in the world's second most popular search engine - YouTube, in
addition to appearing in organic search results.

Gray hat techniques
Gray hat techniques are those that are neither really white nor black hat. Some of these gray
hat techniques may be argued either way. These techniques might have some risk associated
with them. A very good example of such a technique is purchasing links. The average price
for a text link depends on the perceived authority of the linking page. The authority is
sometimes measured by Google's PageRank, although this is not necessarily an accurate way
of determining the importance of a page.

While Google is against sale and purchase of links there are people who subscribe to online
magazines, memberships and other resources for the purpose of getting a link back to their
website.

Another widely used gray hat technique is a webmaster creating multiple 'micro-sites' which
he or she controls for the sole purpose of cross linking to the target site. Since it is the same
owner of all the micro-sites, this is a violation of the principles of the search engine's
algorithms (by self-linking) but since ownership of sites is not traceable by search engines it
is impossible to detect and therefore they can appear as different sites, especially when using
separate Class-C IPs.

In computing, spamdexing (also known as search spam, search engine spam, web spam or
Search Engine Poisoning is the deliberate manipulation of search engine indexes It involves
a number of methods, such as repeating unrelated phrases, to manipulate the relevance or
prominence of resources indexed in a manner inconsistent with the purpose of the indexing
system. Some consider it to be a part of search engine optimization, though there are many
search engine optimization methods that improve the quality and appearance of the content of
web sites and serve content useful to many users. Search engines use a variety of algorithms
to determine relevancy ranking. Some of these include determining whether the search term
appears in the META keywords tag, others whether the search term appears in the body Text
or URL of a web page. Many search engines check for instances of spamdexing and will
remove suspect pages from their indexes. Also, people working for a search-engine
organization can quickly block the results-listing from entire websites that use spamdexing,
perhaps alerted by user complaints of false matches. The rise of spamdexing in the mid-1990s
made the leading search engines of the time less useful.

Common spamdexing techniques can be classified into two broad classes: content spam (or
term spam) and link spam.




Content spam
These techniques involve altering the logical view that a search engine has over the page's
contents. They all aim at variants of the vector space model for information retrieval on text
collections.

Keyword stuffing

Keyword stuffing involves the calculated placement of keywords within a page to raise the
keyword count, variety, and density of the page. This is useful to make a page appear to be
relevant for a web crawler in a way that makes it more likely to be found. Example: A
promoter of a Ponzi scheme wants to attract web surfers to a site where he advertises his
scam. He places hidden text appropriate for a fan page of a popular music group on his page,
hoping that the page will be listed as a fan site and receive many visits from music lovers.
Older versions of indexing programs simply counted how often a keyword appeared, and
used that to determine relevance levels. Most modern search engines have the ability to
analyze a page for keyword stuffing and determine whether the frequency is consistent with
other sites created specifically to attract search engine traffic. Also, large webpages are
truncated, so that massive dictionary lists cannot be indexed on a single webpage.

Hidden or invisible text

Unrelated hidden text is disguised by making it the same color as the background, using a
tiny font size, or hiding it within HTML code such as "no frame" sections, alt attributes, zero-
sized DIVs, and "no script" sections. People screening websites for a search-engine company
might temporarily or permanently block an entire website for having invisible text on some of
its pages. However, hidden text is not always spamdexing: it can also be used to enhance
accessibility.

Meta-tag stuffing

This involves repeating keywords in the Meta tags, and using meta keywords that are
unrelated to the site's content. This tactic has been ineffective since 2005.

Doorway pages

"Gateway" or doorway pages are low-quality web pages created with very little content but
are instead stuffed with very similar keywords and phrases. They are designed to rank highly
within the search results, but serve no purpose to visitors looking for information. A doorway
page will generally have "click here to enter" on the page.

scraper sites

Scraper sites sites, are created using various programs designed to "scrape" search-engine
results pages or other sources of content and create "content" for a website. The specific
presentation of content on these sites is unique, but is merely an amalgamation of content
taken from other sources, often without permission. Such websites are generally full of
advertising (such as pay-per-click ads), or they redirect the user to other sites. It is even
feasible for scraper sites to outrank original websites for their own information and
organization names.

Article spinning

Article spinning involves rewriting existing articles, as opposed to merely scraping content
from other sites, to avoid penalties imposed by search engines for duplicate content. This
process is undertaken by hired writers or automated using a thesaurus database or a neural
network.

Link spam
Link spam is defined as links between pages that are present for reasons other than merit.
Link spam takes advantage of link-based ranking algorithms, which gives websites higher
rankings the more other highly ranked websites link to it. These techniques also aim at
influencing other link-based ranking techniques such as the HITS algorithm.

Link-building software

A common form of link spam is the use of link-building software to automate the search
engine optimization process.

Link farms

Link farms are tightly-knit communities of pages referencing each other, also known
humorously as mutual admiration societies.

Hidden links

Putting hyperlinks where visitors will not see them to increase link popularity. Highlighted
link text can help rank a webpage higher for matching that phrase.

Sybil attack

A Sybil attack is the forging of multiple identities for malicious intent, named after the
famous multiple personality disorder patient "Sybil" (Shirley Ardell Mason). A spammer may
create multiple web sites at different domain names that all link to each other, such as fake
blogs (known as spam blogs).

Spam blogs

Spam blogs are blogs created solely for commercial promotion and the passage of link
authority to target sites. Often these "splogs" are designed in a misleading manner that will
give the effect of a legitimate website but upon close inspection will often be written using
spinning software or very poorly written and barely readable content. They are similar in
nature to link farms.

Page hijacking

Page hijacking is achieved by creating a rogue copy of a popular website which shows
contents similar to the original to a web crawler but redirects web surfers to unrelated or
malicious websites.

Buying expired domains

Some link spammers monitor DNS records for domains that will expire soon, then buy them
when they expire and replace the pages with links to their pages. See Domaining. However
Google resets the link data on expired domains. Some of these techniques may be applied for
creating a Google bomb, this is, to cooperate with other users to boost the ranking of a
particular page for a particular query.
Cookie stuffing

Cookie stuffing involves placing an affiliate tracking cookie on a website visitor's computer
without their knowledge, which will then generate revenue for the person doing the cookie
stuffing. This not only generates fraudulent affiliate sales, but also has the potential to
overwrite other affiliates' cookies, essentially stealing their legitimately earned commissions.

] Using world-writable pages
Main article: forum spam

Web sites that can be edited by users can be used by spandexes to insert links to spam sites if
the appropriate anti-spam measures are not taken.

Automated spam bots can rapidly make the user-editable portion of a site unusable.
Programmers have developed a variety of automated spam prevention techniques to block or
at least slow down spam bots.

Spam in blogs

Spam in blogs is the placing or solicitation of links randomly on other sites, placing a desired
keyword into the hyperlinked text of the inbound link. Guest books, forums, blogs, and any
site that accepts visitors' comments are particular targets and are often victims of drive-by
spamming where automated software creates nonsense posts with links that are usually
irrelevant and unwanted.

Comment spam

Comment spam is a form of link spam that has arisen in web pages that allow dynamic user
editing such as wikis, blogs, and guest books. It can be problematic because agents can be
written that automatically randomly select a user edited web page, such as a Wikipedia
article, and add spamming links.

Wiki spam

Wiki spam is a form of link spam on wiki pages. The spammer uses the open edit ability of
wiki systems to place links from the wiki site to the spam site. The subject of the spam site is
often unrelated to the wiki page where the link is added. In early 2005, Wikipedia
implemented a default "no follow" value for the "rel" HTML attribute. Links with this
attribute are ignored by Google's Page Rank algorithm. Forum and Wiki admins can use these
to discourage Wiki spam.

Referrer log spamming

Referrer spam takes place when a spam perpetrator or facilitator accesses a web page (the
referee), by following a link from another web page (the referrer), so that the referee is given
the address of the referrer by the person's Internet browser. Some websites have a referrer log
which shows which pages link to that site. By having a robot randomly access many sites
enough times, with a message or specific address given as the referrer, that message or
Internet address then appears in the referrer log of those sites that have referrer logs. Since
some Web search engines base the importance of sites on the number of different sites
linking to them, referrer-log spam may increase the search engine rankings of the spammer's
sites. Also, site administrators who notice the referrer log entries in their logs may follow the
link back to the spammer's referrer page.



2[http://www.webconfs.com/seo-tutorial/introduction-to-seo.php]

Whenever you enter a query in a search engine and hit 'enter' you get a list of web results that
contain that query term. Users normally tend to visit websites that are at the top of this list as
they perceive those to be more relevant to the query. If you have ever wondered why some of
these websites rank better than the others then you must know that it is because of a powerful
web marketing technique called Search Engine Optimization (SEO).

SEO is a technique which helps search engines find and rank your site higher than the
millions of other sites in response to a search query. SEO thus helps you get traffic from
search engines.

This SEO tutorial covers all the necessary information you need to know about Search
Engine Optimization - what is it, how does it work and differences in the ranking criteria of
major search engines.

1. How Search Engines Work




The first basic truth you need to know to learn SEO is that search engines are not humans.
While this might be obvious for everybody, the differences between how humans and search
engines view web pages aren't. Unlike humans, search engines are text-driven. Although
technology advances rapidly, search engines are far from intelligent creatures that can feel the
beauty of a cool design or enjoy the sounds and movement in movies. Instead, search engines
crawl the Web, looking at particular site items (mainly text) to get an idea what a site is
about. This brief explanation is not the most precise because as we will see next, search
engines perform several activities in order to deliver search results – crawling, indexing,
processing, calculating relevancy, and retrieving.
First, search engines crawl the Web to see what is there. This task is performed by a piece of
software, called a crawler or a spider (or Googlebot, as is the case with Google). Spiders
follow links from one page to another and index everything they find on their way. Having in
mind the number of pages on the Web (over 20 billion), it is impossible for a spider to visit a
site daily just to see if a new page has appeared or if an existing page has been modified,
sometimes crawlers may not end up visiting your site for a month or two.

What you can do is to check what a crawler sees from your site. As already mentioned,
crawlers are not humans and they do not see images, Flash movies, JavaScript, frames,
password-protected pages and directories, so if you have tons of these on your site, you'd
better run the Spider Simulator below to see if these goodies are viewable by the spider. If
they are not viewable, they will not be spidered, not indexed, not processed, etc. - in a word
they will be non-existent for search engines.

After a page is crawled, the next step is to index its content. The indexed page is stored in a
giant database, from where it can later be retrieved. Essentially, the process of indexing is
identifying the words and expressions that best describe the page and assigning the page to
particular keywords. For a human it will not be possible to process such amounts of
information but generally search engines deal just fine with this task. Sometimes they might
not get the meaning of a page right but if you help them by optimizing it, it will be easier for
them to classify your pages correctly and for you – to get higher rankings.

When a search request comes, the search engine processes it – i.e. it compares the search
string in the search request with the indexed pages in the database. Since it is likely that more
than one page (practically it is millions of pages) contains the search string, the search engine
starts calculating the relevancy of each of the pages in its index with the search string.

There are various algorithms to calculate relevancy. Each of these algorithms has different
relative weights for common factors like keyword density, links, or metatags. That is why
different search engines give different search results pages for the same search string. What is
more, it is a known fact that all major search engines, like Yahoo!, Google, Bing, etc.
periodically change their algorithms and if you want to keep at the top, you also need to adapt
your pages to the latest changes. This is one reason (the other is your competitors) to devote
permanent efforts to SEO, if you'd like to be at the top.

The last step in search engines' activity is retrieving the results. Basically, it is nothing more
than simply displaying them in the browser – i.e. the endless pages of search results that are
sorted from the most relevant to the least relevant sites.



Indexing

Search engine indexing collects, parses, and stores data to facilitate fast and accurate
information retrieval. Index design incorporates interdisciplinary concepts from linguistics,
cognitive psychology, mathematics, informatics, physics and computer science. An alternate
name for the process in the context of search engines designed to find web pages on the
Internet is Web indexing.
Popular engines focus on the full-text indexing of online, natural language documents. Media
types such as video and audio and graphics are also searchable.

Meta search engines reuse the indices of other services and do not store a local index,
whereas cache-based search engines permanently store the index along with the corpus.
Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size.
Larger services typically perform indexing at a predetermined time interval due to the
required time and processing costs, while agent-based search engines index in real time.



1 Search engine architecture

[http://www.ibm.com/]

Architecture overview

The architecture of a common Web search engine contains a front-end process and a back-
end process, as shown in Figure 1. In the front-end process, the user enters the search words
into the search engine interface, which is usually a Web page with an input box. The
application then parses the search request into a form that the search engine can understand,
and then the search engine executes the search operation on the index files. After ranking, the
search engine interface returns the search results to the user. In the back-end process, a spider
or robot fetches the Web pages from the Internet, and then the indexing subsystem parses the
Web pages and stores them into the index files. If you want to use Lucene to build a Web
search application, the final architecture will be similar to that shown in Figure 1.


Figure 1. Web search engine architecture
Implement advanced search with Lucene

Lucene supports several kinds of advanced searches, which I'll discuss in this section. I'll then
demonstrate how to implement these searches with Lucene's Application Programming
Interfaces (APIs).

Boolean operators

Most search engines provide Boolean operators so users can compose queries. Typical
Boolean operators are AND, OR, and NOT. Lucene provides five Boolean operators: AND,
OR, NOT, plus (+), and minus (-). I'll describe each of these operators.

   •   OR: If you want to search for documents that contain the words "A" or "B," use the
       OR operator. Keep in mind that if you don't put any Boolean operator between two
       search words, the OR operator will be added between them automatically. For
       example, "Java OR Lucene" and "Java Lucene" both search for the terms "Java" or
       "Lucene."
   •   AND: If you want to search for documents that contain more than one word, use the
       AND operator. For example, "Java AND Lucene" returns all documents that contain
       both "Java" and "Lucene."
   •   NOT: Documents that contain the search word immediately after the NOT operator
       won't be retrieved. For example, if you want to search for documents that contain
       "Java" but not "Lucene," you may use the query "Java NOT Lucene." You cannot use
       this operator with only one term. For example, the query "NOT Java" returns no
       results.
   •   +: The function of this operator is similar to the AND operator, but it only applies to
       the word immediately following it. For example, if you want to search documents that
       must contain "Java" and may contain "Lucene," you can use the query "+Java
       Lucene."
   •   -: The function of this operator is the same as the NOT operator. The query "Java
       -Lucene" returns all of the documents that contain "Java" but not "Lucene."

Now look at how to implement a query with Boolean operators using Lucene's API. Listing 1
shows the process of doing searches with Boolean operators.




Field search

Lucene supports field search. You can specify the fields that a query will be executed on. For
example, if your document contains two fields, Title and Content, you can use the query
"Title: Lucene AND Content: Java" to search for documents that contain the term "Lucene"
in the Title field and "Java" in the Content field. Listing 2 shows how to use Lucene's API to
do a field search.

Wildcard search

Lucene supports two wildcard symbols: the question mark (?) and the asterisk (*). You can
use ? to perform a single-character wildcard search, and you can use * to perform a multiple-
character wildcard search. For example, if you want to search for "tiny" or "tony," you can
use the query "t?ny," and if you want to search for "Teach," "Teacher," and "Teaching," you
can use the query "Teach*." Listing 3 demonstrates the process of doing a wildcard search.




Fuzzy search

Lucene provides a fuzzy search that's based on an edit distance algorithm. You can use the
tilde character (~) at the end of a single search word to do a fuzzy search. For example, the
query "think~" searches for the terms similar in spelling to the term "think." Listing 4
features sample code that conducts a fuzzy search with Lucene's API.




Range search

A range search matches the documents whose field values are in a range. For example, the
query "age:[18 TO 35]" returns all of the documents with the value of the "age" field between
18 and 35. Listing 5 shows the process of doing a range search with Lucene's API.

2     Searching a small national domain--Preliminary report
           András A. Benczúr Károly Csalogány Dániel Fogaras Eszter Friedman
                          Tamás Sarlós Máté Uher Eszter Windhager
    Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI)
                          11 Lagymanyosi u., H--1111 Budapest, Hungary
       Eötvös University, Budapest, and Budapest University of Technology and Economics
              {benczur,cskaresz,fd,feszter,stamas,umate,hexapoda}@ilab.sztaki.hu
                               http://www.ilab.sztaki.hu/websearch

ABSTRACT

Small languages represent a non-negligible portion of the Web with interest for a large
population with less literacy in English. Existing search engine solutions however vary in
quality mostly because a few of these languages have a particularly complicated syntax that
requires communication between linguistic tools and "classical" Web search techniques. In
this paper we present development stage experiments of a search engine for the .hu or other
similar national domains. Such an engine differs in several design issues from large-scale
engines; as an example we apply efficient crawling and indexing policies that may enable
breaking news search.

Keywords

Web crawling, search engines, database freshness, refresh policies.
1. INTRODUCTION

While search engines face the challenge of billions of Web pages with content in English or
some other major language, small languages such as the Hungarian represent a quantity sizes
of magnitude smaller. A few of the small languages have particularly complicated syntax--
most notably Hungarian is one of the languages that troubles a search engine designer the
most by requiring complex interaction between linguistic tools and ranking methods.

Searching such a small national domain is of commercial interest for a non-negligible
population; existing solutions vary in quality. Despite of its importance the design,
experiments or benchmark tests about small-range national search engine development only
exceptionally appear in the literature.

A national domain such as the .hu covers a moderate size of not much more than ten million
HTML pages and concentrates most of the national language documents. The size,
concentration and network locality allows an engine to run on an inexpensive architecture
while it may by orders of magnitude outperform large-scale engines in keeping the index up
to date. In our experiments we use a refresh policy that may fit for the purpose of a news
agent. Our architecture will hence necessarily differ from those appearing in published
experiments or benchmark tests at several points.

In this paper we present development stage experiments of a search engine for the .hu or
other similar domain that may form the base of a regional or a distributed engine. Peculiar to
search over small domains is our a refresh policy and the extensive use of clustering methods.
We revisit documents in a more sophisticated way than in [13]; the procedure is based on our
experiments that show a behavior different from [7] over the Hungarian web. Our method for
detecting domain boundaries improves on [8] by using hyperlink graph clustering to aid
boundary detection. We give a few examples that show the difficulty of Web information
retrieval in agglutinative and multi-language environments; natural language processing in
more detail is however not addressed in this work.

2. ARCHITECTURE

The high-level architecture of a search engine is described in various surveys [2, 4, 13]. Since
we only crawl among a few tens of millions of pages, our engine also contains a mixture of a
news agent and a focused crawler. Similar to the extended crawl model of [5], we include
long-term and short-term crawl managers. The manager determines the long-term schedule
refresh policy based on PageRank [2] information. Our harvester (modified Larbin 2.6.2 [1])
provides short-term schedule and also serves as a seeder by identifying new URLs.
Eventually all new URLs are visited or banned by site-specific rules. We apply a two-level
indexer (BerkeleyDB 3.3.11 [12]) based on [11] that updates a temporary index in frequent
batches; temporary index items are merged into permanent index at off-the-peak times and at
certain intervals the entire index is recompiled.

Due to space limitations a number of implementation issues is described in the full paper
such as the modifications of the crawler, its interface to long-term refresh scheduler as well as
indexing architecture and measurements.
Figure 1: Search engine architecture

3. INFORMATION RETRIEVAL IN AN AGGLUTINATIVE AND MULTI-LANGUAGE
ENVIRONMENT

Hungarian is one of the languages that troubles a search engine designer with lost hits or
severe topic drifts when the natural language issues are not properly handled and integrated
into all levels of the engine itself. We found single words with over 300 different word form
occurrences in Web documents (Figure 2); in theory this number may be as large as ten
thousand while compound rules are near as permissive as in German. Word forms of stop
words may or may not be stop words themselves. Technical documents are often filled with
English words and documents may be bi- or multilingual. Stemming is expensive and thus,
unlike in classical IR, we stem in large batches instead of at each word occurrence.
Figure 2: number of word stems with given number of word form occurrences in 5 million
                             Hungarian language documents

Polysemy often confuses search results and makes top hits irrelevant to the query. "Java", the
most frequently cited example of a word with several meanings, in Hungarian also means
"majority of", "its belongings", "its goods", "its best portion", a type of pork, and may also be
incorrectly identified as an agglutination of a frequent abbreviation in mailing lists.
Synonyms "dog" and "hound" ("kutya" and "eb") occur nonexchangeably in compounds such
as "hound breeders" and "dog shows" while one of them also stands for the abbreviation of
European Championship. Or the name of a widely used data mining software "sas" translates
back to English as "eagle" that in addition frequently occurs in Hungarian name compounds.

In order to achieve acceptable precision in Web information retrieval for multilingual
environments or agglutinative languages with a very large number of word forms we need to
consider a large number of mixed linguistic, ranking and IR issues.

   •   We index stems at appropriate levels as for example in
       (((áll)am)((((ad)ó)s)ság)) = ((stand)state)((((give)tax)debtor)debt)
       we might not even want to stem at all. Too deep stemming causes topic drift; weak
       stemming suffices since relevant hits are expected to contain the query word several times
       and among its forms the stem is likely to occur.
•   Efficient phrase search is nontrivial on its own [3, 13], we also need the original word form in
       the index for this task.
   •   In order to rank the relevance of an index entry we may need to know the syntax of the
       sentence that contains the word or use document word frequency clustering results.
   •   Document language(s) and possible missing accents (typical in mailing list archives) must
       also be taken into consideration or else the index term easily changes meaning (beer, for
       example, becomes queue with no accents--"sör" and "sor").
   •   Translations of the stems and forms between Hungarian and English helps ranking
       algorithms that use anchor text information.
   •   All above issues tend to increase index granularity and the amount of additional information
       stored and thus space requirement must be carefully optimized.

4. RANKING AND DOMAIN CLUSTERING

A unique possibility in searching a moderate size domain is the extensive and relative
inexpensive application of clustering methods. As a key application we determine coherent
domains or multi-part documents. As noticed by Davison [8], hyperlink analysis yields
accurate quality measures only if applied to a higher level link structure of inter-domain
links. The URL text analysis method [8] however often fails for the .hu domain, resulting in
unfair PageRank values for a large number of sites without using our clustering method.

5. REFRESH POLICY

Our refresh policy is based on the observed refresh time of the page and its PageRank [2]. We
extend the argument of Cho et al. [6] by weighting freshness measures by functions of the
PageRank pr: given the refresh time refr, we are looking for sync, the synchronization time
function over pages that maximize PageRank times expected freshness. The optimum
solution of the system can be obtained by solving

                         pr * (refr - (sync + refr) exp (-sync / refr)) = u


where we let Lagrange multiplier u be maximum such that the download capacity constraint is not
exceeded. We compute an approximate solution by increasing refresh rate in discrete steps for
documents with minimum current u value. The number of equations are reduced by discretizing
PageRank and frequency values.

We propose another fine tuning for efficiently crawling breaking news. Documents that need
not be revisited every day may be safely scheduled for off-peak hours, thus during daytime
we concentrate on news sites, sites with high PageRank and quick changes. At off-peak hours
however, frequent visits to news portals are of little use and priorities may be modified
accordingly.

6. EXPERIMENTS

We estimate not much more than ten million "interesting" pages of the .hu domain residing at
an approximate 300,000 Web sites. In the current experiment we crawled five million; in
order to obtain a larger collection of valuable documents we need refined site-specific control
over in-site depth and following links that pass arguments.
Among 4.7 million files of size over 30 bytes, depending on its settings the language guesser
finds Hungarian language 70-90% while English 27-34% of the time (pages may be
multilingual or different incorrect guesses may be made at different settings). Outside the .hu
we found an additional 280,000 pages, mostly in Hungarian, by simple stop word heuristics;
more refined language guessing turned out too expensive for this task.

We conducted a preliminary measurement for the lifetime of HTML documents. Unlike
suggested by [7], our results in Figure 3 do not confirm a relation of refresh rates and the
PageRank of any kind. Hence we use the computed optimal schedule based on observed
refresh rate and the PageRank; in this schedule documents of a given rank are reloaded more
frequently as their lifetime decreases until a certain point where freshness updates are quickly
given up.




  Figure 3: Scatterplot of PageRank and average lifetime (in minutes) of 3 million pages of
  the .hu domain. In order to amplify low rank pages we show 1/Pagerank on the horizontal
                                            axis.

We found that pre-parsed document text as in [10] provides good indication of a content
update. Portals and news sites should in fact be revisited by more sophisticated rules: for
example only a few index.html's need to be recrawled and links must be followed in order to
keep the entire database fresh. Methods for indexing only content and not the automatically
generated navigational and advertisement blocks are given in [9].

7. ACKNOWLEDGMENTS

To MorphoLogic Inc. for providing us with their language guesser (LangWitch) and their
Hungarian stemming module (HelyesLem).

Research was supported from NKFP-2/0017/2002 project Data Riddle and various ETIK,
OTKA and AKP grants.

3 Architecture of a Search Engine - Components and Process

 [http://www.beatgoogleusa.com/]


Any internet search engine consists of two major parts - a Back End Database (server side)
and a GUI (client side)...




                                  Basic Search Engine (architecture)


...to facilitate the user to type the search term. On the server side, the process involves creation of a
database and its periodic updating done by a software called Spider. The spider "crawls" the URL
pages periodically and indexes the crawled pages in the database. The hyperlinked nature of the
Internet makes it possible for the spider to traverse the web. The interface between the client and
server side consists of matching the posted query with the entries in the database and retrieving the
matched URLs to the user's machine.
The spider crawls the web pages through the hyperlinks. In this process it extracts the 'title',
'keywords', and any other related information needed for the database from the HTML document.
Sometimes, the entire content of the HTML document, (but for the stop words - very common words
such as for, is etc.), is extracted and indexed in the database. This is based on the idea that a page
dealing with a particular issue will have relevant words throughout its page. Thus indexing all the
words in a document increases the probability of getting the relevant URLs to a query. One point is
worth noting here: before the query words are processed they are removed of the morphological
inflections before they are searched for in the database. The spider is also referred to by names:
"Robot", "Crawler", "Indexer" etc. The database consists of a number of tables arranged to aid in
quick retrieval of the data. With the number of sites increasing it is common for search engines to
maintain more than one database server. When the user queries for term(s), these particular term(s)
is(are) searched in the database. The sites in which these term(s) are present are identified. Then
these sites are ranked on the basis of the relevancy they have with the user query. The ranked sites
are then displayed, with links to these sites and a small description taken from the site itself so as to
give an idea to the user about the site.
Five key building blocks of a crawling search engine -
CRAWLER (or ROBOT) - a specialised automated program that follows links found on web pages, and
directs the spider where to go next by finding new sites for it to visit. When you add your URL to a
Search Engine, it is the crawler you are requesting to visit your site.
SPIDER (or ROBOT) - an automatic browser-like program that downloads documents found on the
web by the crawler. It works very much as a browser does when it connects to a website and
downloads pages. Most spiders aren't interested in images though, and don't ask for them to be
sent.
INDEXER - a program that "reads" the pages that are downloaded by spiders. This does most of the
work deciding what your site is about. The words in the site are "read". Some are thrown away, as
they are so common (and, it, the etc). It will also examine the HTML code which makes up your site
looking for other clues as to which words you consider to be important. Words in bold, italic or
headers tags will be given more weight. This is also where the meta information (the keywords and
description tags) for your site will be analysed.
DATABASE - index for storage of the pages downloaded and processed. It is where the information
gathered by the indexer is stored.
RESULTS ENGINE - generates search results out of the database, according to your query. This is the
most important part of any Search Engine. The results engine is the customer-facing (UI) portion of a
Search Engine, and as such is the focus of most optimisation efforts. It is the results engine's function
to return the pages most relevant to a users query. When a user types in a keyword or phrase, the
results engine must decide which pages are most likely to be useful to the user. The method it uses
to decide that is called its "algorithm". You may hear Search Engine Optimisation (SEO) experts
discuss "algos" or "breaking the algo" for a particular search engine. After all, if you know what the
criteria being used are, you can write pages to take advantage of them.
Search engine

More Related Content

What's hot

Working of search engine
Working of search engineWorking of search engine
Working of search engineNikhil Deswal
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search EngineNIKHIL NAIR
 
Metasearch engines
Metasearch enginesMetasearch engines
Metasearch enginesHamza Saeed
 
google search engine
google search enginegoogle search engine
google search engineway2go
 
How search engines work
How search engines workHow search engines work
How search engines workChinna Botla
 
Meta Search Engines
Meta Search EnginesMeta Search Engines
Meta Search Enginesvcsstudent
 
Training Project Report on Search Engines
Training Project Report on Search EnginesTraining Project Report on Search Engines
Training Project Report on Search EnginesShivam Saxena
 
Meta Search Engines, what is it?
Meta Search Engines, what is it?Meta Search Engines, what is it?
Meta Search Engines, what is it?Sara shall
 
Learn the Search Engine Type and Its Functions!
Learn the Search Engine Type and Its Functions!Learn the Search Engine Type and Its Functions!
Learn the Search Engine Type and Its Functions!aashokkr
 
Search Engines Presentation
Search Engines PresentationSearch Engines Presentation
Search Engines PresentationJSCHO9
 
Notes for
Notes forNotes for
Notes for9pallen
 

What's hot (20)

Working of search engine
Working of search engineWorking of search engine
Working of search engine
 
Search engine
Search engineSearch engine
Search engine
 
Search engine
Search engineSearch engine
Search engine
 
Search engine
Search engineSearch engine
Search engine
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
 
Metasearch engines
Metasearch enginesMetasearch engines
Metasearch engines
 
How search engine work ppt
How search engine work pptHow search engine work ppt
How search engine work ppt
 
Search engines
Search enginesSearch engines
Search engines
 
search engines
search enginessearch engines
search engines
 
Search engine
Search engineSearch engine
Search engine
 
google search engine
google search enginegoogle search engine
google search engine
 
How search engines work
How search engines workHow search engines work
How search engines work
 
Meta Search Engines
Meta Search EnginesMeta Search Engines
Meta Search Engines
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Training Project Report on Search Engines
Training Project Report on Search EnginesTraining Project Report on Search Engines
Training Project Report on Search Engines
 
Meta Search Engines, what is it?
Meta Search Engines, what is it?Meta Search Engines, what is it?
Meta Search Engines, what is it?
 
Search engine
Search engineSearch engine
Search engine
 
Learn the Search Engine Type and Its Functions!
Learn the Search Engine Type and Its Functions!Learn the Search Engine Type and Its Functions!
Learn the Search Engine Type and Its Functions!
 
Search Engines Presentation
Search Engines PresentationSearch Engines Presentation
Search Engines Presentation
 
Notes for
Notes forNotes for
Notes for
 

Viewers also liked (18)

Biology
BiologyBiology
Biology
 
10g zomerdroom
10g zomerdroom10g zomerdroom
10g zomerdroom
 
Villa casa meas brochure
Villa casa meas brochureVilla casa meas brochure
Villa casa meas brochure
 
งานนำเสนอ Sar
งานนำเสนอ Sarงานนำเสนอ Sar
งานนำเสนอ Sar
 
Analisis numerico
Analisis numericoAnalisis numerico
Analisis numerico
 
Intellignos - Research & Analytics: Credentials, Services and Business Cases ...
Intellignos - Research & Analytics: Credentials, Services and Business Cases ...Intellignos - Research & Analytics: Credentials, Services and Business Cases ...
Intellignos - Research & Analytics: Credentials, Services and Business Cases ...
 
Zooooooom
ZooooooomZooooooom
Zooooooom
 
Zooooooom
ZooooooomZooooooom
Zooooooom
 
African ppp doc 1
African ppp   doc 1African ppp   doc 1
African ppp doc 1
 
Panduan edu3103
Panduan edu3103Panduan edu3103
Panduan edu3103
 
Graphics
GraphicsGraphics
Graphics
 
งานนำเสนอ Sar
งานนำเสนอ Sarงานนำเสนอ Sar
งานนำเสนอ Sar
 
Major Project Final Semester ppt
Major Project Final Semester pptMajor Project Final Semester ppt
Major Project Final Semester ppt
 
Positive attitude and goal setting
Positive attitude and goal settingPositive attitude and goal setting
Positive attitude and goal setting
 
Gui based debuggers
Gui based debuggers Gui based debuggers
Gui based debuggers
 
Presentation1
Presentation1Presentation1
Presentation1
 
Ip spoofing (seminar report)
Ip spoofing (seminar report)Ip spoofing (seminar report)
Ip spoofing (seminar report)
 
PRESENTAZIONE OPPORTUNITà
PRESENTAZIONE OPPORTUNITàPRESENTAZIONE OPPORTUNITà
PRESENTAZIONE OPPORTUNITà
 

Similar to Search engine

Seo,sem and smo
Seo,sem and smoSeo,sem and smo
Seo,sem and smosajappy
 
Search%2520engine%2520optimization%2520and%2520social%2520media%2520optimizat...
Search%2520engine%2520optimization%2520and%2520social%2520media%2520optimizat...Search%2520engine%2520optimization%2520and%2520social%2520media%2520optimizat...
Search%2520engine%2520optimization%2520and%2520social%2520media%2520optimizat...kburns11
 
SEARCH ENGINE OPTIMIZATION: AN ILLUSTRIOUS APPROACH FOR WEB BASED MARKETING I...
SEARCH ENGINE OPTIMIZATION: AN ILLUSTRIOUS APPROACH FOR WEB BASED MARKETING I...SEARCH ENGINE OPTIMIZATION: AN ILLUSTRIOUS APPROACH FOR WEB BASED MARKETING I...
SEARCH ENGINE OPTIMIZATION: AN ILLUSTRIOUS APPROACH FOR WEB BASED MARKETING I...Journal For Research
 
SEO Essentials for 2021
SEO Essentials for 2021SEO Essentials for 2021
SEO Essentials for 2021Ioana Barbu
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimizationaayoustic
 
search engine optimization | seo | on page optimization | w3 validator | keyw...
search engine optimization | seo | on page optimization | w3 validator | keyw...search engine optimization | seo | on page optimization | w3 validator | keyw...
search engine optimization | seo | on page optimization | w3 validator | keyw...iwebtra seo cochin
 
Advanced latest Seo 2019
Advanced latest Seo 2019Advanced latest Seo 2019
Advanced latest Seo 2019RomeshKotian
 
Search engine optimization | seo | sem | link building | on page optimization...
Search engine optimization | seo | sem | link building | on page optimization...Search engine optimization | seo | sem | link building | on page optimization...
Search engine optimization | seo | sem | link building | on page optimization...iwebtra seo cochin
 
Digital marketing 102 search engine optimization
Digital marketing 102 search engine optimizationDigital marketing 102 search engine optimization
Digital marketing 102 search engine optimizationShreyans Nahar
 
Legal Publish SEO Webinar
Legal Publish SEO WebinarLegal Publish SEO Webinar
Legal Publish SEO WebinarLegal Publish
 
The Best Introduction of SEO training courses.pptx
The Best Introduction of SEO training courses.pptxThe Best Introduction of SEO training courses.pptx
The Best Introduction of SEO training courses.pptxirfanakram32
 
Search engine optimization
Search engine optimizationSearch engine optimization
Search engine optimizationNaga Gopinath
 
Search engine optimization
Search engine optimizationSearch engine optimization
Search engine optimizationRujata Patil
 

Similar to Search engine (20)

Seo,sem and smo
Seo,sem and smoSeo,sem and smo
Seo,sem and smo
 
Seo
SeoSeo
Seo
 
Search%2520engine%2520optimization%2520and%2520social%2520media%2520optimizat...
Search%2520engine%2520optimization%2520and%2520social%2520media%2520optimizat...Search%2520engine%2520optimization%2520and%2520social%2520media%2520optimizat...
Search%2520engine%2520optimization%2520and%2520social%2520media%2520optimizat...
 
SEARCH ENGINE OPTIMIZATION: AN ILLUSTRIOUS APPROACH FOR WEB BASED MARKETING I...
SEARCH ENGINE OPTIMIZATION: AN ILLUSTRIOUS APPROACH FOR WEB BASED MARKETING I...SEARCH ENGINE OPTIMIZATION: AN ILLUSTRIOUS APPROACH FOR WEB BASED MARKETING I...
SEARCH ENGINE OPTIMIZATION: AN ILLUSTRIOUS APPROACH FOR WEB BASED MARKETING I...
 
SEO Essentials for 2021
SEO Essentials for 2021SEO Essentials for 2021
SEO Essentials for 2021
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimization
 
search engine optimization | seo | on page optimization | w3 validator | keyw...
search engine optimization | seo | on page optimization | w3 validator | keyw...search engine optimization | seo | on page optimization | w3 validator | keyw...
search engine optimization | seo | on page optimization | w3 validator | keyw...
 
SEO
SEOSEO
SEO
 
Advanced latest Seo 2019
Advanced latest Seo 2019Advanced latest Seo 2019
Advanced latest Seo 2019
 
Search engine optimization | seo | sem | link building | on page optimization...
Search engine optimization | seo | sem | link building | on page optimization...Search engine optimization | seo | sem | link building | on page optimization...
Search engine optimization | seo | sem | link building | on page optimization...
 
SEO For Beginners
SEO For BeginnersSEO For Beginners
SEO For Beginners
 
SEO Tutorial
SEO TutorialSEO Tutorial
SEO Tutorial
 
Digital marketing 102 search engine optimization
Digital marketing 102 search engine optimizationDigital marketing 102 search engine optimization
Digital marketing 102 search engine optimization
 
Legal Publish SEO Webinar
Legal Publish SEO WebinarLegal Publish SEO Webinar
Legal Publish SEO Webinar
 
The Best Introduction of SEO training courses.pptx
The Best Introduction of SEO training courses.pptxThe Best Introduction of SEO training courses.pptx
The Best Introduction of SEO training courses.pptx
 
Search engine optimization
Search engine optimizationSearch engine optimization
Search engine optimization
 
Search engine optimization
Search engine optimizationSearch engine optimization
Search engine optimization
 
Free seo-book
Free seo-bookFree seo-book
Free seo-book
 
Seo
SeoSeo
Seo
 
Seo Report
Seo ReportSeo Report
Seo Report
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Search engine

  • 1. 1[http://en.wikipedia.org/wiki/Search_engine_optimization] Search engine optimization (SEO) is the process of improving the visibility of a website or a webpage in search engines via the "natural" or un-paid ("organic" or "algorithmic") search results. Additional Search engine marketing (SEM) methods including paid listings may achieve higher effectivity. In general, the earlier (or higher on the page), and more frequently a site appears in the search results list, the more visitors it will receive from the search engine's users. SEO may target different kinds of search, including image search, local search, video search, academic search, news search and industry-specific vertical search engines. This gives a website web presence. As an Internet marketing strategy, SEO considers how search engines work, what people search for, the actual search terms typed into search engines and which search engines are preferred by their targeted audience. Optimizing a website may involve editing its content and HTML and associated coding to both increase its relevance to specific keywords and to remove barriers to the indexing activities of search engines. Promoting a site to increase the number of backlinks, or inbound links, is another SEO tactic. The acronym "SEOs" can refer to "search engine optimizers," a term adopted by an industry of consultants who carry out optimization projects on behalf of clients, and by employees who perform SEO services in-house. Search engine optimizers may offer SEO as a stand- alone service or as a part of a broader marketing campaign. Because effective SEO may require changes to the HTML source code of a site and site content, SEO tactics may be incorporated into website development and design. The term "search engine friendly" may be used to describe website designs, menus, content management systems, images, videos, shopping carts, and other elements that have been optimized for the purpose of search engine exposure. Another class of techniques, known as black hat SEO, search engine poisoning, or spamdexing, uses methods such as link farms, keyword stuffing and article spinning that degrade both the relevance of search results and the quality of user-experience with search engines. Search engines look for sites that employ these techniques in order to remove them from their indices. Search engine optimization methods are techniques used by webmasters to get more visibility for their sites in search engine results pages. • Getting indexed The leading search engines, such as Google Bing and Yahoo, use crawlers to find pages for their algorithmic search results. Pages that are linked from other search engine indexed pages do not need to be submitted because they are found automatically. Some search engines, notably Yahoo!, operate a paid submission service that guarantees crawling for either a set fee or cost per click. Such programs usually guarantee inclusion in the database, but do not guarantee specific ranking within the search results. Two major directories, the Yahoo Directory and the Open Directory Project both require manual submission and human editorial review.[2] Google offers Google Webmaster Tools, for which an XMLSitemap feed
  • 2. can be created and submitted for free to ensure that all pages are found, especially pages that aren't discoverable by automatically following links.[3] Search engine crawlers may look at a number of different factors when crawling a site. Not every page is indexed by the search engines. Distance of pages from the root directory of a site may also be a factor in whether or not pages get crawled. Other methods A variety of other methods are employed to get a webpage indexed and shown higher in the results and often a combination of these methods are used as part of a search engine optimization campaign. • Cross linking between pages of the same website. Giving more links to main pages of the website, to increase Page Rank used by search engines. Linking from other websites, including link farming and comment spam. • Keyword rich text in the webpage and key phrases, so as to match all search queries.[7] Adding relevant keywords to a web page meta tags, including keyword stuffing. • URL normalization for webpages with multiple urls, using "canonical" meta tag.[8] • A backlink from a Web directory. • SEO Trending based on recent search behaviour using tools like Google Insights for Search. • Media Content creation like press releases and online news letters to generate an amount of incoming links Content Creation and Linking Content creation is one of the primary focuses of any SEO's job. Without unique, relevant, and easily scannable content users tend to spend little to no time paying attention to a website. Almost all SEOs that provide organic search improve focus heavily on creating this type of content, or "linkbait". Linkbait is a term used to describe content that is designed to be shared and replicated virally in an effort to gain backlinks. Often, webmasters and content administrators create blogs to easily provide this information through a method that is intrinsically viral. However, most forget that traffic generated to blog accounts don't point back to their respective domains, so they lose "link juice". Link juice is jargon for links that provide a boost to Page Rank and Trust Rank. Changing the domain of the blog, to a subdomain of the respective domain is a quick way to combat this siphoning of link juice . Other commonly implemented methodologies for creating and disseminating content include YouTube Videos, Google Places accounts, as well as Picasa and Flickr photos indexed in Google Images Searches. These additional forms of content allow webmasters to produce content that ranks well in the world's second most popular search engine - YouTube, in addition to appearing in organic search results. Gray hat techniques
  • 3. Gray hat techniques are those that are neither really white nor black hat. Some of these gray hat techniques may be argued either way. These techniques might have some risk associated with them. A very good example of such a technique is purchasing links. The average price for a text link depends on the perceived authority of the linking page. The authority is sometimes measured by Google's PageRank, although this is not necessarily an accurate way of determining the importance of a page. While Google is against sale and purchase of links there are people who subscribe to online magazines, memberships and other resources for the purpose of getting a link back to their website. Another widely used gray hat technique is a webmaster creating multiple 'micro-sites' which he or she controls for the sole purpose of cross linking to the target site. Since it is the same owner of all the micro-sites, this is a violation of the principles of the search engine's algorithms (by self-linking) but since ownership of sites is not traceable by search engines it is impossible to detect and therefore they can appear as different sites, especially when using separate Class-C IPs. In computing, spamdexing (also known as search spam, search engine spam, web spam or Search Engine Poisoning is the deliberate manipulation of search engine indexes It involves a number of methods, such as repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system. Some consider it to be a part of search engine optimization, though there are many search engine optimization methods that improve the quality and appearance of the content of web sites and serve content useful to many users. Search engines use a variety of algorithms to determine relevancy ranking. Some of these include determining whether the search term appears in the META keywords tag, others whether the search term appears in the body Text or URL of a web page. Many search engines check for instances of spamdexing and will remove suspect pages from their indexes. Also, people working for a search-engine organization can quickly block the results-listing from entire websites that use spamdexing, perhaps alerted by user complaints of false matches. The rise of spamdexing in the mid-1990s made the leading search engines of the time less useful. Common spamdexing techniques can be classified into two broad classes: content spam (or term spam) and link spam. Content spam These techniques involve altering the logical view that a search engine has over the page's contents. They all aim at variants of the vector space model for information retrieval on text collections. Keyword stuffing Keyword stuffing involves the calculated placement of keywords within a page to raise the keyword count, variety, and density of the page. This is useful to make a page appear to be relevant for a web crawler in a way that makes it more likely to be found. Example: A
  • 4. promoter of a Ponzi scheme wants to attract web surfers to a site where he advertises his scam. He places hidden text appropriate for a fan page of a popular music group on his page, hoping that the page will be listed as a fan site and receive many visits from music lovers. Older versions of indexing programs simply counted how often a keyword appeared, and used that to determine relevance levels. Most modern search engines have the ability to analyze a page for keyword stuffing and determine whether the frequency is consistent with other sites created specifically to attract search engine traffic. Also, large webpages are truncated, so that massive dictionary lists cannot be indexed on a single webpage. Hidden or invisible text Unrelated hidden text is disguised by making it the same color as the background, using a tiny font size, or hiding it within HTML code such as "no frame" sections, alt attributes, zero- sized DIVs, and "no script" sections. People screening websites for a search-engine company might temporarily or permanently block an entire website for having invisible text on some of its pages. However, hidden text is not always spamdexing: it can also be used to enhance accessibility. Meta-tag stuffing This involves repeating keywords in the Meta tags, and using meta keywords that are unrelated to the site's content. This tactic has been ineffective since 2005. Doorway pages "Gateway" or doorway pages are low-quality web pages created with very little content but are instead stuffed with very similar keywords and phrases. They are designed to rank highly within the search results, but serve no purpose to visitors looking for information. A doorway page will generally have "click here to enter" on the page. scraper sites Scraper sites sites, are created using various programs designed to "scrape" search-engine results pages or other sources of content and create "content" for a website. The specific presentation of content on these sites is unique, but is merely an amalgamation of content taken from other sources, often without permission. Such websites are generally full of advertising (such as pay-per-click ads), or they redirect the user to other sites. It is even feasible for scraper sites to outrank original websites for their own information and organization names. Article spinning Article spinning involves rewriting existing articles, as opposed to merely scraping content from other sites, to avoid penalties imposed by search engines for duplicate content. This process is undertaken by hired writers or automated using a thesaurus database or a neural network. Link spam
  • 5. Link spam is defined as links between pages that are present for reasons other than merit. Link spam takes advantage of link-based ranking algorithms, which gives websites higher rankings the more other highly ranked websites link to it. These techniques also aim at influencing other link-based ranking techniques such as the HITS algorithm. Link-building software A common form of link spam is the use of link-building software to automate the search engine optimization process. Link farms Link farms are tightly-knit communities of pages referencing each other, also known humorously as mutual admiration societies. Hidden links Putting hyperlinks where visitors will not see them to increase link popularity. Highlighted link text can help rank a webpage higher for matching that phrase. Sybil attack A Sybil attack is the forging of multiple identities for malicious intent, named after the famous multiple personality disorder patient "Sybil" (Shirley Ardell Mason). A spammer may create multiple web sites at different domain names that all link to each other, such as fake blogs (known as spam blogs). Spam blogs Spam blogs are blogs created solely for commercial promotion and the passage of link authority to target sites. Often these "splogs" are designed in a misleading manner that will give the effect of a legitimate website but upon close inspection will often be written using spinning software or very poorly written and barely readable content. They are similar in nature to link farms. Page hijacking Page hijacking is achieved by creating a rogue copy of a popular website which shows contents similar to the original to a web crawler but redirects web surfers to unrelated or malicious websites. Buying expired domains Some link spammers monitor DNS records for domains that will expire soon, then buy them when they expire and replace the pages with links to their pages. See Domaining. However Google resets the link data on expired domains. Some of these techniques may be applied for creating a Google bomb, this is, to cooperate with other users to boost the ranking of a particular page for a particular query.
  • 6. Cookie stuffing Cookie stuffing involves placing an affiliate tracking cookie on a website visitor's computer without their knowledge, which will then generate revenue for the person doing the cookie stuffing. This not only generates fraudulent affiliate sales, but also has the potential to overwrite other affiliates' cookies, essentially stealing their legitimately earned commissions. ] Using world-writable pages Main article: forum spam Web sites that can be edited by users can be used by spandexes to insert links to spam sites if the appropriate anti-spam measures are not taken. Automated spam bots can rapidly make the user-editable portion of a site unusable. Programmers have developed a variety of automated spam prevention techniques to block or at least slow down spam bots. Spam in blogs Spam in blogs is the placing or solicitation of links randomly on other sites, placing a desired keyword into the hyperlinked text of the inbound link. Guest books, forums, blogs, and any site that accepts visitors' comments are particular targets and are often victims of drive-by spamming where automated software creates nonsense posts with links that are usually irrelevant and unwanted. Comment spam Comment spam is a form of link spam that has arisen in web pages that allow dynamic user editing such as wikis, blogs, and guest books. It can be problematic because agents can be written that automatically randomly select a user edited web page, such as a Wikipedia article, and add spamming links. Wiki spam Wiki spam is a form of link spam on wiki pages. The spammer uses the open edit ability of wiki systems to place links from the wiki site to the spam site. The subject of the spam site is often unrelated to the wiki page where the link is added. In early 2005, Wikipedia implemented a default "no follow" value for the "rel" HTML attribute. Links with this attribute are ignored by Google's Page Rank algorithm. Forum and Wiki admins can use these to discourage Wiki spam. Referrer log spamming Referrer spam takes place when a spam perpetrator or facilitator accesses a web page (the referee), by following a link from another web page (the referrer), so that the referee is given the address of the referrer by the person's Internet browser. Some websites have a referrer log which shows which pages link to that site. By having a robot randomly access many sites enough times, with a message or specific address given as the referrer, that message or Internet address then appears in the referrer log of those sites that have referrer logs. Since
  • 7. some Web search engines base the importance of sites on the number of different sites linking to them, referrer-log spam may increase the search engine rankings of the spammer's sites. Also, site administrators who notice the referrer log entries in their logs may follow the link back to the spammer's referrer page. 2[http://www.webconfs.com/seo-tutorial/introduction-to-seo.php] Whenever you enter a query in a search engine and hit 'enter' you get a list of web results that contain that query term. Users normally tend to visit websites that are at the top of this list as they perceive those to be more relevant to the query. If you have ever wondered why some of these websites rank better than the others then you must know that it is because of a powerful web marketing technique called Search Engine Optimization (SEO). SEO is a technique which helps search engines find and rank your site higher than the millions of other sites in response to a search query. SEO thus helps you get traffic from search engines. This SEO tutorial covers all the necessary information you need to know about Search Engine Optimization - what is it, how does it work and differences in the ranking criteria of major search engines. 1. How Search Engines Work The first basic truth you need to know to learn SEO is that search engines are not humans. While this might be obvious for everybody, the differences between how humans and search engines view web pages aren't. Unlike humans, search engines are text-driven. Although technology advances rapidly, search engines are far from intelligent creatures that can feel the beauty of a cool design or enjoy the sounds and movement in movies. Instead, search engines crawl the Web, looking at particular site items (mainly text) to get an idea what a site is about. This brief explanation is not the most precise because as we will see next, search engines perform several activities in order to deliver search results – crawling, indexing, processing, calculating relevancy, and retrieving.
  • 8. First, search engines crawl the Web to see what is there. This task is performed by a piece of software, called a crawler or a spider (or Googlebot, as is the case with Google). Spiders follow links from one page to another and index everything they find on their way. Having in mind the number of pages on the Web (over 20 billion), it is impossible for a spider to visit a site daily just to see if a new page has appeared or if an existing page has been modified, sometimes crawlers may not end up visiting your site for a month or two. What you can do is to check what a crawler sees from your site. As already mentioned, crawlers are not humans and they do not see images, Flash movies, JavaScript, frames, password-protected pages and directories, so if you have tons of these on your site, you'd better run the Spider Simulator below to see if these goodies are viewable by the spider. If they are not viewable, they will not be spidered, not indexed, not processed, etc. - in a word they will be non-existent for search engines. After a page is crawled, the next step is to index its content. The indexed page is stored in a giant database, from where it can later be retrieved. Essentially, the process of indexing is identifying the words and expressions that best describe the page and assigning the page to particular keywords. For a human it will not be possible to process such amounts of information but generally search engines deal just fine with this task. Sometimes they might not get the meaning of a page right but if you help them by optimizing it, it will be easier for them to classify your pages correctly and for you – to get higher rankings. When a search request comes, the search engine processes it – i.e. it compares the search string in the search request with the indexed pages in the database. Since it is likely that more than one page (practically it is millions of pages) contains the search string, the search engine starts calculating the relevancy of each of the pages in its index with the search string. There are various algorithms to calculate relevancy. Each of these algorithms has different relative weights for common factors like keyword density, links, or metatags. That is why different search engines give different search results pages for the same search string. What is more, it is a known fact that all major search engines, like Yahoo!, Google, Bing, etc. periodically change their algorithms and if you want to keep at the top, you also need to adapt your pages to the latest changes. This is one reason (the other is your competitors) to devote permanent efforts to SEO, if you'd like to be at the top. The last step in search engines' activity is retrieving the results. Basically, it is nothing more than simply displaying them in the browser – i.e. the endless pages of search results that are sorted from the most relevant to the least relevant sites. Indexing Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is Web indexing.
  • 9. Popular engines focus on the full-text indexing of online, natural language documents. Media types such as video and audio and graphics are also searchable. Meta search engines reuse the indices of other services and do not store a local index, whereas cache-based search engines permanently store the index along with the corpus. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while agent-based search engines index in real time. 1 Search engine architecture [http://www.ibm.com/] Architecture overview The architecture of a common Web search engine contains a front-end process and a back- end process, as shown in Figure 1. In the front-end process, the user enters the search words into the search engine interface, which is usually a Web page with an input box. The application then parses the search request into a form that the search engine can understand, and then the search engine executes the search operation on the index files. After ranking, the search engine interface returns the search results to the user. In the back-end process, a spider or robot fetches the Web pages from the Internet, and then the indexing subsystem parses the Web pages and stores them into the index files. If you want to use Lucene to build a Web search application, the final architecture will be similar to that shown in Figure 1. Figure 1. Web search engine architecture
  • 10. Implement advanced search with Lucene Lucene supports several kinds of advanced searches, which I'll discuss in this section. I'll then demonstrate how to implement these searches with Lucene's Application Programming Interfaces (APIs). Boolean operators Most search engines provide Boolean operators so users can compose queries. Typical Boolean operators are AND, OR, and NOT. Lucene provides five Boolean operators: AND, OR, NOT, plus (+), and minus (-). I'll describe each of these operators. • OR: If you want to search for documents that contain the words "A" or "B," use the OR operator. Keep in mind that if you don't put any Boolean operator between two search words, the OR operator will be added between them automatically. For example, "Java OR Lucene" and "Java Lucene" both search for the terms "Java" or "Lucene." • AND: If you want to search for documents that contain more than one word, use the AND operator. For example, "Java AND Lucene" returns all documents that contain both "Java" and "Lucene." • NOT: Documents that contain the search word immediately after the NOT operator won't be retrieved. For example, if you want to search for documents that contain "Java" but not "Lucene," you may use the query "Java NOT Lucene." You cannot use this operator with only one term. For example, the query "NOT Java" returns no results. • +: The function of this operator is similar to the AND operator, but it only applies to the word immediately following it. For example, if you want to search documents that must contain "Java" and may contain "Lucene," you can use the query "+Java Lucene." • -: The function of this operator is the same as the NOT operator. The query "Java -Lucene" returns all of the documents that contain "Java" but not "Lucene." Now look at how to implement a query with Boolean operators using Lucene's API. Listing 1 shows the process of doing searches with Boolean operators. Field search Lucene supports field search. You can specify the fields that a query will be executed on. For example, if your document contains two fields, Title and Content, you can use the query "Title: Lucene AND Content: Java" to search for documents that contain the term "Lucene" in the Title field and "Java" in the Content field. Listing 2 shows how to use Lucene's API to do a field search. Wildcard search Lucene supports two wildcard symbols: the question mark (?) and the asterisk (*). You can use ? to perform a single-character wildcard search, and you can use * to perform a multiple-
  • 11. character wildcard search. For example, if you want to search for "tiny" or "tony," you can use the query "t?ny," and if you want to search for "Teach," "Teacher," and "Teaching," you can use the query "Teach*." Listing 3 demonstrates the process of doing a wildcard search. Fuzzy search Lucene provides a fuzzy search that's based on an edit distance algorithm. You can use the tilde character (~) at the end of a single search word to do a fuzzy search. For example, the query "think~" searches for the terms similar in spelling to the term "think." Listing 4 features sample code that conducts a fuzzy search with Lucene's API. Range search A range search matches the documents whose field values are in a range. For example, the query "age:[18 TO 35]" returns all of the documents with the value of the "age" field between 18 and 35. Listing 5 shows the process of doing a range search with Lucene's API. 2 Searching a small national domain--Preliminary report András A. Benczúr Károly Csalogány Dániel Fogaras Eszter Friedman Tamás Sarlós Máté Uher Eszter Windhager Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI) 11 Lagymanyosi u., H--1111 Budapest, Hungary Eötvös University, Budapest, and Budapest University of Technology and Economics {benczur,cskaresz,fd,feszter,stamas,umate,hexapoda}@ilab.sztaki.hu http://www.ilab.sztaki.hu/websearch ABSTRACT Small languages represent a non-negligible portion of the Web with interest for a large population with less literacy in English. Existing search engine solutions however vary in quality mostly because a few of these languages have a particularly complicated syntax that requires communication between linguistic tools and "classical" Web search techniques. In this paper we present development stage experiments of a search engine for the .hu or other similar national domains. Such an engine differs in several design issues from large-scale engines; as an example we apply efficient crawling and indexing policies that may enable breaking news search. Keywords Web crawling, search engines, database freshness, refresh policies.
  • 12. 1. INTRODUCTION While search engines face the challenge of billions of Web pages with content in English or some other major language, small languages such as the Hungarian represent a quantity sizes of magnitude smaller. A few of the small languages have particularly complicated syntax-- most notably Hungarian is one of the languages that troubles a search engine designer the most by requiring complex interaction between linguistic tools and ranking methods. Searching such a small national domain is of commercial interest for a non-negligible population; existing solutions vary in quality. Despite of its importance the design, experiments or benchmark tests about small-range national search engine development only exceptionally appear in the literature. A national domain such as the .hu covers a moderate size of not much more than ten million HTML pages and concentrates most of the national language documents. The size, concentration and network locality allows an engine to run on an inexpensive architecture while it may by orders of magnitude outperform large-scale engines in keeping the index up to date. In our experiments we use a refresh policy that may fit for the purpose of a news agent. Our architecture will hence necessarily differ from those appearing in published experiments or benchmark tests at several points. In this paper we present development stage experiments of a search engine for the .hu or other similar domain that may form the base of a regional or a distributed engine. Peculiar to search over small domains is our a refresh policy and the extensive use of clustering methods. We revisit documents in a more sophisticated way than in [13]; the procedure is based on our experiments that show a behavior different from [7] over the Hungarian web. Our method for detecting domain boundaries improves on [8] by using hyperlink graph clustering to aid boundary detection. We give a few examples that show the difficulty of Web information retrieval in agglutinative and multi-language environments; natural language processing in more detail is however not addressed in this work. 2. ARCHITECTURE The high-level architecture of a search engine is described in various surveys [2, 4, 13]. Since we only crawl among a few tens of millions of pages, our engine also contains a mixture of a news agent and a focused crawler. Similar to the extended crawl model of [5], we include long-term and short-term crawl managers. The manager determines the long-term schedule refresh policy based on PageRank [2] information. Our harvester (modified Larbin 2.6.2 [1]) provides short-term schedule and also serves as a seeder by identifying new URLs. Eventually all new URLs are visited or banned by site-specific rules. We apply a two-level indexer (BerkeleyDB 3.3.11 [12]) based on [11] that updates a temporary index in frequent batches; temporary index items are merged into permanent index at off-the-peak times and at certain intervals the entire index is recompiled. Due to space limitations a number of implementation issues is described in the full paper such as the modifications of the crawler, its interface to long-term refresh scheduler as well as indexing architecture and measurements.
  • 13. Figure 1: Search engine architecture 3. INFORMATION RETRIEVAL IN AN AGGLUTINATIVE AND MULTI-LANGUAGE ENVIRONMENT Hungarian is one of the languages that troubles a search engine designer with lost hits or severe topic drifts when the natural language issues are not properly handled and integrated into all levels of the engine itself. We found single words with over 300 different word form occurrences in Web documents (Figure 2); in theory this number may be as large as ten thousand while compound rules are near as permissive as in German. Word forms of stop words may or may not be stop words themselves. Technical documents are often filled with English words and documents may be bi- or multilingual. Stemming is expensive and thus, unlike in classical IR, we stem in large batches instead of at each word occurrence.
  • 14. Figure 2: number of word stems with given number of word form occurrences in 5 million Hungarian language documents Polysemy often confuses search results and makes top hits irrelevant to the query. "Java", the most frequently cited example of a word with several meanings, in Hungarian also means "majority of", "its belongings", "its goods", "its best portion", a type of pork, and may also be incorrectly identified as an agglutination of a frequent abbreviation in mailing lists. Synonyms "dog" and "hound" ("kutya" and "eb") occur nonexchangeably in compounds such as "hound breeders" and "dog shows" while one of them also stands for the abbreviation of European Championship. Or the name of a widely used data mining software "sas" translates back to English as "eagle" that in addition frequently occurs in Hungarian name compounds. In order to achieve acceptable precision in Web information retrieval for multilingual environments or agglutinative languages with a very large number of word forms we need to consider a large number of mixed linguistic, ranking and IR issues. • We index stems at appropriate levels as for example in (((áll)am)((((ad)ó)s)ság)) = ((stand)state)((((give)tax)debtor)debt) we might not even want to stem at all. Too deep stemming causes topic drift; weak stemming suffices since relevant hits are expected to contain the query word several times and among its forms the stem is likely to occur.
  • 15. Efficient phrase search is nontrivial on its own [3, 13], we also need the original word form in the index for this task. • In order to rank the relevance of an index entry we may need to know the syntax of the sentence that contains the word or use document word frequency clustering results. • Document language(s) and possible missing accents (typical in mailing list archives) must also be taken into consideration or else the index term easily changes meaning (beer, for example, becomes queue with no accents--"sör" and "sor"). • Translations of the stems and forms between Hungarian and English helps ranking algorithms that use anchor text information. • All above issues tend to increase index granularity and the amount of additional information stored and thus space requirement must be carefully optimized. 4. RANKING AND DOMAIN CLUSTERING A unique possibility in searching a moderate size domain is the extensive and relative inexpensive application of clustering methods. As a key application we determine coherent domains or multi-part documents. As noticed by Davison [8], hyperlink analysis yields accurate quality measures only if applied to a higher level link structure of inter-domain links. The URL text analysis method [8] however often fails for the .hu domain, resulting in unfair PageRank values for a large number of sites without using our clustering method. 5. REFRESH POLICY Our refresh policy is based on the observed refresh time of the page and its PageRank [2]. We extend the argument of Cho et al. [6] by weighting freshness measures by functions of the PageRank pr: given the refresh time refr, we are looking for sync, the synchronization time function over pages that maximize PageRank times expected freshness. The optimum solution of the system can be obtained by solving pr * (refr - (sync + refr) exp (-sync / refr)) = u where we let Lagrange multiplier u be maximum such that the download capacity constraint is not exceeded. We compute an approximate solution by increasing refresh rate in discrete steps for documents with minimum current u value. The number of equations are reduced by discretizing PageRank and frequency values. We propose another fine tuning for efficiently crawling breaking news. Documents that need not be revisited every day may be safely scheduled for off-peak hours, thus during daytime we concentrate on news sites, sites with high PageRank and quick changes. At off-peak hours however, frequent visits to news portals are of little use and priorities may be modified accordingly. 6. EXPERIMENTS We estimate not much more than ten million "interesting" pages of the .hu domain residing at an approximate 300,000 Web sites. In the current experiment we crawled five million; in order to obtain a larger collection of valuable documents we need refined site-specific control over in-site depth and following links that pass arguments.
  • 16. Among 4.7 million files of size over 30 bytes, depending on its settings the language guesser finds Hungarian language 70-90% while English 27-34% of the time (pages may be multilingual or different incorrect guesses may be made at different settings). Outside the .hu we found an additional 280,000 pages, mostly in Hungarian, by simple stop word heuristics; more refined language guessing turned out too expensive for this task. We conducted a preliminary measurement for the lifetime of HTML documents. Unlike suggested by [7], our results in Figure 3 do not confirm a relation of refresh rates and the PageRank of any kind. Hence we use the computed optimal schedule based on observed refresh rate and the PageRank; in this schedule documents of a given rank are reloaded more frequently as their lifetime decreases until a certain point where freshness updates are quickly given up. Figure 3: Scatterplot of PageRank and average lifetime (in minutes) of 3 million pages of the .hu domain. In order to amplify low rank pages we show 1/Pagerank on the horizontal axis. We found that pre-parsed document text as in [10] provides good indication of a content update. Portals and news sites should in fact be revisited by more sophisticated rules: for example only a few index.html's need to be recrawled and links must be followed in order to
  • 17. keep the entire database fresh. Methods for indexing only content and not the automatically generated navigational and advertisement blocks are given in [9]. 7. ACKNOWLEDGMENTS To MorphoLogic Inc. for providing us with their language guesser (LangWitch) and their Hungarian stemming module (HelyesLem). Research was supported from NKFP-2/0017/2002 project Data Riddle and various ETIK, OTKA and AKP grants. 3 Architecture of a Search Engine - Components and Process [http://www.beatgoogleusa.com/] Any internet search engine consists of two major parts - a Back End Database (server side) and a GUI (client side)... Basic Search Engine (architecture) ...to facilitate the user to type the search term. On the server side, the process involves creation of a database and its periodic updating done by a software called Spider. The spider "crawls" the URL pages periodically and indexes the crawled pages in the database. The hyperlinked nature of the Internet makes it possible for the spider to traverse the web. The interface between the client and server side consists of matching the posted query with the entries in the database and retrieving the matched URLs to the user's machine. The spider crawls the web pages through the hyperlinks. In this process it extracts the 'title',
  • 18. 'keywords', and any other related information needed for the database from the HTML document. Sometimes, the entire content of the HTML document, (but for the stop words - very common words such as for, is etc.), is extracted and indexed in the database. This is based on the idea that a page dealing with a particular issue will have relevant words throughout its page. Thus indexing all the words in a document increases the probability of getting the relevant URLs to a query. One point is worth noting here: before the query words are processed they are removed of the morphological inflections before they are searched for in the database. The spider is also referred to by names: "Robot", "Crawler", "Indexer" etc. The database consists of a number of tables arranged to aid in quick retrieval of the data. With the number of sites increasing it is common for search engines to maintain more than one database server. When the user queries for term(s), these particular term(s) is(are) searched in the database. The sites in which these term(s) are present are identified. Then these sites are ranked on the basis of the relevancy they have with the user query. The ranked sites are then displayed, with links to these sites and a small description taken from the site itself so as to give an idea to the user about the site. Five key building blocks of a crawling search engine - CRAWLER (or ROBOT) - a specialised automated program that follows links found on web pages, and directs the spider where to go next by finding new sites for it to visit. When you add your URL to a Search Engine, it is the crawler you are requesting to visit your site. SPIDER (or ROBOT) - an automatic browser-like program that downloads documents found on the web by the crawler. It works very much as a browser does when it connects to a website and downloads pages. Most spiders aren't interested in images though, and don't ask for them to be sent. INDEXER - a program that "reads" the pages that are downloaded by spiders. This does most of the work deciding what your site is about. The words in the site are "read". Some are thrown away, as they are so common (and, it, the etc). It will also examine the HTML code which makes up your site looking for other clues as to which words you consider to be important. Words in bold, italic or headers tags will be given more weight. This is also where the meta information (the keywords and description tags) for your site will be analysed. DATABASE - index for storage of the pages downloaded and processed. It is where the information gathered by the indexer is stored. RESULTS ENGINE - generates search results out of the database, according to your query. This is the most important part of any Search Engine. The results engine is the customer-facing (UI) portion of a Search Engine, and as such is the focus of most optimisation efforts. It is the results engine's function to return the pages most relevant to a users query. When a user types in a keyword or phrase, the results engine must decide which pages are most likely to be useful to the user. The method it uses to decide that is called its "algorithm". You may hear Search Engine Optimisation (SEO) experts discuss "algos" or "breaking the algo" for a particular search engine. After all, if you know what the criteria being used are, you can write pages to take advantage of them.