2. INTRODUCTION
• Finding useful information on the World Wide Web is something many of us take for
granted. According to the Internet research firm Netcraft, there are nearly 150,000,000
active Web sites on the Internet today.
• Google's algorithm does the work for you by searching out Web pages that contain
the keywords you used to search, then assigning a rank to each page based several
factors, including how many times the keywords appear on the page. Higher ranked
pages appear further up in Google's search engine results page (SERP), meaning that
the best links relating to your search query are theoretically the first ones Google lists.
• Automated programs called spiders or crawlers travel the Web, moving from link to link
and building up an index page that includes certain keywords. Google references this
index when a user enters a search query. The search engine lists the pages that contain
the same keywords that were in the user's search terms.
3. • Also like other search engines, Google has a large index of keywords and where those words can be found.
What sets Google apart is how it ranks search results, which in turn determines the order Google displays results
on its search engine results page (SERP). Google uses a trademarked algorithm called PageRank, which assigns
each Web page a relevancy score.
• Keyword placement plays a part in how Google finds sites. Google looks for keywords throughout each Web
page, but some sections are more important than others. Including the keyword in the Web page's title is a
good idea, for example. Google also searches for keywords in headings.
How to decide which page is to be selected and which has to be left out,
google does this by asking questions 200 of them, few important ones are:
i. How many time the keyword is contained in the page ? i.e.
frequency of the word in the page
ii. Do words appear in title ,URL, directly adjacent, meta tag?
iii. Does page include Synonyms..
iv. Page from quality website, low quality,…
v. Page rank?
4. PAGERANKING ALGORITHM
• Google’s PageRank algorithm has become one of the most famous in
computer science. It was originally designed to rank websites according
to their importance by assuming that a site is important if it is linked to by
other important sites it follows the real life philosophy that
“How does a product or an individual get popular when people other
than the individual know about that individual or product “
which is similar to page ranking of a page when other webpages has a
link to the specific web page.
• The algorithm works by counting the links to a website and the
importance of the sites these come from. It then uses this to work out the
importance of the original site. Through a process of iteration, the
algorithm comes up with a ranking.
5. • PageRank assigns a rank or score to every search result. The higher the page's
score, the further up the search results list it will appear.
• Scores are partially determined by the number of other Web pages that link to
the target page. Each link is counted as a vote for the target. The logic behind
this is that pages with high quality content will be linked to, more often than
mediocre pages.
• Not all votes are equal. Votes from a high-ranking Web page count more than
votes from low-ranking sites. You can't really boost one Web page's rank by
making a bunch of empty Web sites linking back to the target page.
• The more links a Web page sends out, the more diluted its voting power
becomes. In other words, if a high-ranking page links to hundreds of other pages,
each individual vote won't count as much as it would if the page only linked to a
few sites.
• Other factors that might affect scoring include the how long the site has been
around, the strength of the domain name, how and where the keywords appear
on the site and the age of the links going to and from the site. Google tends to
place more value on sites that have been around for a while.
6. A Web page's PageRank depends on a few factors:
• The frequency and location of keywords within the Web page: If the
keyword only appears once within the body of a page, it will receive
a low score for that keyword.
• How long the Web page has existed: People create new Web pages
every day, and not all of them stick around for long. Google places
more value on pages with an established history.
• The number of other Web pages that link to the page in question:
Google looks at how many Web pages link to a particular site to
determine its relevance.
7. • Out of these three factors, the third
is the most important. It's easier to
understand it with an example.
• Let's look at a search for the terms
"Planet Earth.“
• As more Web pages link to
Discovery's Planet Earth page, the
Discovery page's rank increases.
When Discovery's page ranks higher
than other pages, it shows up at the
top of the Google search results
page.
8. PageRank description
We assume page A has pages T1...Tn which point to it .
The parameter d is a damping factor which can be set between 0 and 1. We usually set d
to 0.85.
The PageRank theory holds that an imaginary surfer who is randomly clicking on links will
eventually stop clicking.
The probability, at any step, that the person will continue is a damping factor d.
Various studies have tested different damping factors, but it is generally assumed that the
damping factor will be set around 0.85.
Also C(A) is defined as the number of links going out of page A.
The PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
the PageRank's form a probability distribution over web pages,
“so the sum of all web pages' PageRank's will be one”.
9. How is PageRank Calculated?
• The PR of each page depends on the PR of the pages pointing to it. But
we won’t know what PR those pages have until the pages pointing
to them have their PR calculated and so on… And when you consider that
page links can form circles it seems impossible to do this calculation!
• the Google paper says:
PageRank or PR(A) can be calculated using a simple iterative algorithm,
and corresponds to the principal eigenvector of the normalized link matrix of
the web.
What that means to us is that we can just go ahead and calculate a page’s
PR without knowing the final value of the PR of the other pages. That seems
strange but, basically, each time we run the calculation we’re getting a
closer estimate of the final value. So all we need to do is remember the
each value we calculate and repeat the calculations lots of times until the
numbers stop changing much.
10. Lets take the simplest example network: two pages, each pointing to the
other:
Each page has one outgoing link (the outgoing count is 1, i.e. C(A) = 1 and
C(B) = 1).
1. GUESS 1 d= 0.85
PR(A)= (1 – d) + d(PR(B)/1)
PR(B)= (1 – d) + d(PR(A)/1)
PR(A)= 0.15 + 0.85 * 1
= 1
PR(B)= 0.15 + 0.85 * 1
= 1
We don’t know what their PR should be to begin with, so let’s take a guess at 1.0 and do some calculations:
i.e.
11. 2. GUESS 2
PR(A)= 0.15 + 0.85 * 0
= 0.15
PR(B)= 0.15 + 0.85 * 0.15
= 0.2775
PR(A)= 0.15 + 0.85 * 0.2775
= 0.385875
PR(B)= 0.15 + 0.85 * 0.385875
= 0.47799375
PR(A)= 0.15 + 0.85 * 0.47799375
= 0.5562946875
PR(B)= 0.15 + 0.85 * 0.5562946875
= 0.622850484375
Ok, let’s start the guess at 0 instead and re-calculate:
And again:
And again:
and so on. The numbers just keep going up. But will the numbers stop increasing when they get to 1.0? What if a calculation
over-shoots and goes above 1.0?
12. 3. GUESS 3
Let’s start the guess at 40 each and do a few cycles:
PR(A) = 40
• Principle: it doesn’t matter where you start your guess, once the PageRank calculations
have settled down, the “normalized probability distribution” (the average PageRank for
all pages) will be 1.0
PR(A)= 0.15 + 0.85 * 40
= 34.25
PR(B)= 0.15 + 0.85 * 0.385875
= 29.1775
PR(A)= 0.15 + 0.85 * 29.1775
= 24.950875
PR(B)= 0.15 + 0.85 * 24.950875
= 21.35824375
First calculation
And again
13. PR(D)= (1-d) + d * (0)
= 0.15
no backlinks means the equation looks like this:
no matter what else is going on or how many times you do it.
Observation: every page has at least a PR of 0.15 to share out.
14. • Our home page has 2 and a
half times as much PR as the
child pages! Excellent!
• This is what we’d expect. All
the pages have the same
number of incoming links, all
pages are of equal
importance to each other, all
pages get the same PR of 1.0
(i.e. the “average”
probability).
15. EXAMPLES
• Because Google looks at links to a Web page as a vote, it's not easy to cheat the system. The best way to make sure
your Web page is high up on Google's search results is to provide great content so that people will link back to your
page. The more links your page gets, the higher its PageRank score will be. If you attract the attention of sites with a
high PageRank score, your score will grow faster.
• Mega-sites, like http://news.bbc.co.uk have tens or hundreds of editors writing new content – i.e. new pages - all day
long! Each one of those pages has rich, worthwhile content of its own and a link back to its parent or the home page!
That’s why the Home page Toolbar PR of these sites is 9/10 and the rest of us just get pushed lower and lower by
comparison…
• Principle: Content Is King! There really is no substitute for lots of good content…
16. Steps to a enhance your PAGERANK
1.Give visitors the information they're looking for
• Provide high-quality content on your pages, especially your homepage. This is the single most
important thing to do. If your pages contain useful information,their content will attract many
visitors and entice webmasters to link to your site. Think about the words users would type to
find your pages and include those words on your site.
2. Make sure that other sites link to yours
• Links help our crawlers find your site and can give your site greater visibility in our search results.
When returning results for a search, Google uses sophisticated text-matching techniques to
display pages that are both important and relevant to each search. Google interprets a link
from page A to page B as a vote by page A for page B.
3. Make your site easily accessible
• Build your site with a logical link structure. Every page should be reachable from at least one
static text link.