Introduction
• An algorithm for ranking web pages based on their importance
• Developed by Lawrence Page and Sergey Brin (founders of Google)
• Being used In Google to sort search results
• Describes how probable web pages are to be visited by a random
web surfer
• It is an iterative graph processing algorithm
Ranking Web Pages
• Web pages are not equally “Important”
• www.amazon.com
• www.my-personal-weblog.com
• It is more likely that amazon.com is visited than the other web page
• So it is more important (it has more weight)
• WHY?
Ranking Web Pages
• Inbound links count
• The more inbound link a page has, the more important (probable to
be visited) it become
• Imagine two web pages
• Page “A” (2 inbound links)
• Page “B” (10 inbound links)
• Which page is more important?
• Page “B”
Ranking Web Pages
• Now suppose this condition
• Page “A” (2 inbound links)
• amazon.com
• facebook.com
• Page “B” (10 inbound linked)
• my-personal-weblog1.com
• …
• my-personal-weblog10.com
• Now which page is more weighted?
Ranking Web Pages
• Inbound links count
• But not all inbound links are equal
• So “importance” (PageRank) of page “P” depends on
• “importance” (PageRank) of the pages that link to page “P” (not barely on the
count of the pages that link to page “P”)
Simple Recursive Formula
• Each link’s weight is proportional to the importance of its source
page
• If page “P” with importance “x” has “n” outbound links, each link
gets “x/n” weight
• Page “P”’s own importance is the sum of the weight on its inbound
links
The Random Surfer Model
• Consider PageRank as a model of user behavior
• Where a surfer clicks on links at random with no regard towards
content
• The random surfer visits a web page with a certain probability which
derives from the page's PageRank
• The probability that the random surfer clicks on one link is solely
given by the number of links on that page
• This is why one page's PageRank is not completely passed on to a
page it links to, but is divided by the number of links on the page
The Random Surfer Model
• So, the probability for the random surfer reaching one page is the
sum of probabilities for the random surfer following links to this
page
• The surfer does not click on an infinite number of links, but gets
bored sometimes and jumps to another page at random
• The probability for the random surfer not stopping to click on links is
given by the “damping factor” (set between 0 and 1)
• The “damping factor” is usually set to 0.85
The Final Formula
• PR(A) =
1−𝑑
𝑁
+ d (
𝑃𝑅(𝑇𝑖)
𝐶(𝑇𝑖)
)
• PR(A) is the PageRank of page A
• PR(Ti) is the PageRank of page Ti which link to page A
• C(Ti) is the number of outbound links on page Ti
• N is the number of web pages
• d is a damping factor which can be set between 0 and 1
Example
• To keep the calculation simple we set the damping factor
to 0.5 and the number of nodes is ignored
• PR(A) = (1-0.5) + 0.5 (
𝑃𝑅(𝑇𝑖)
𝐶(𝑇𝑖)
)
• PR(A) = 0.5 + 0.5 PR(C) = 1.07692308
PR(B) = 0.5 + 0.5 (PR(A) / 2) = 0.76923077
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B)) = 1.15384615
The Iterative Computation of PageRank
• In practice, the web consists of billions of pages and it is not possible
to find a solution by using equation systems
• Google search engine uses an approximative, iterative computation
of PageRank
• Each page is assigned an initial starting value (usually
1
# 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠
) and
the PageRanks of all pages are then calculated in several
computation circles based on the equations determined by the
PageRank algorithm values
Implementing PageRank Using MapReduce
• Multiple stages of mappers and reducers are needed
• Output of reducers are feed into the next stage mappers
• The initial input data for the previous example will be
organized as
A B C
B C
C A
• In each row
• The first column contains our nodes
• Other columns are the nodes that the main node has an outbound link to
Implementing PageRank Using MapReduce
• The initial PageRank values are calculated (
1
# 𝑜𝑓 𝑛𝑜𝑑𝑒𝑠
) and added to
the file
A 1/3 B C
B 1/3 C
C 1/3 A
• In each row
• The first column contains our nodes
• Other columns are the nodes that the main node has an outbound link to
Implementing PageRank Using MapReduce
• Mappers receive values as follows
• (y, PR(y) x1 x2 … xn)
• And emit the following values for each row
• (y, PR(y) x1 x2 … xn)
• for i = 1 … n
(xi,
𝑃𝑅(𝑦)
𝐶(𝑦)
)
Implementing PageRank Using MapReduce
• Reducers receive values from mappers and use the PageRank
formula to aggregate values and calculate new PageRank values
• New Input file for the next phase is created
• The differences between New PageRanks and old PagesRanks are
compared to the convergence factor
Implementing PageRank Using MapReduce
• Mappers in our example
• A 1/3 B C => (A, 1/3 B C)
(B, 1/6)
(C, 1/6)
• B 1/3 C => (B, 1/3 C)
(C, 1/3)
• C 1/3 A => (C, 1/3 A)
(A, 1/3)
Implementing PageRank Using MapReduce
• Reducers in our example
• (A, 1/3 B C) => (A, 1/3 B C)
(A, 1/3)
• (B, 1/3 C) => (B, 1/6 C)
(B, 1/6)
• (C, 1/3 A) => (C, 1/6+1/3 A)
(C, 1/6)
(C, 1/3)
Implementing PageRank Using MapReduce
• The new input file for mappers in the next phase will be
• A 0.3333 B C
B 0.1917 C
C 0.4750 A