The Google Pagerank algorithm - How does it work?

QAB Term 1

Markov Chains and Google Inc.

GUSTAVO ARGUELLO
KUNDAN BHADURI
VERITY NOBLE

IMBA NOV 2010 N1

IE BUSINESS SCHOOL
MARIA DE MOLINA 11
MADRID 28002 SPAIN

QAB Term 1 Project: Markov Chains and Google Inc.

Table of Contents

Implementing Markov Chains with Google PageRank ......................................................................................................... 2
Issues to be addressed ......................................................................................................................................................... 3
Techniques that may be used to overcome the problem of solving such a large system ................................................... 4
Exhibit 1: A sample 4-state Markov chain with transition probabilities .............................................................................. 6
Exhibit 2: Sample 4X4 transition Matrix ............................................................................................................................... 6
Exhibit 3: Explaining the basis of Markov’s chain ................................................................................................................ 6
Exhibit 4: Demonstrating the stable state values using simple matrix multiplication ......................................................... 7
Exhibit 5: Calculating the steady state eigen values πA and πE ............................................................................................ 8
Exhibit 6: The improved Google PageRank algorithm.......................................................................................................... 8
Exhibit 7: PageRank of the search string ‘Techbend blog’ ................................................................................................... 9
Exhibit 8: The correlation between a webpage and the rest of the web ............................................................................ 9
Exhibit 9: KundanBhaduri.com and its links to other sites................................................................................................. 10
Exhibit 10: Applying Markov Chain method to calculate the PageRank for ‘TechBend blog’ ........................................... 11
Exhibit 11: Computing a small Eigen value with Power Method ....................................................................................... 12

IMBA NOV 2010 N1: Gustavo Arguello | Kundan Bhaduri | Verity Noble Page | 1


Implementing Markov Chains with Google PageRank

In its most basic form, a homogeneous Markov chain (Exhibit 1) simply refers to a series of events/actions that follow
one another and that are independent of each other, while the transition from one state to another is memory-less.
More scientifically, a Markov chain is a collection of random variables {Xt} which holds the property that given the
current state, the future is conditionally independent of the past.1 The collection of these variables is shown in a
square matrix which is known as the Transition Matrix. Therefore, we can classify a problem to be solvable by the
theory of Markov chains if it bears the following characteristics:

a) At any point in time, any of the objects should be in one and exactly one defined state. At the end of the period,
the object can move to a new state or remain in its original state 2.
b) The objects move between states based on the transition probabilities (Exhibit 2) that depend on only the
current state. The sum of all probabilities of moving to all possible states should be one.
c) The transition probabilities (of going from A to B) remain constant over time.

In order to develop an understanding of how to solve the Markov chain, assume that the simple 2-state chain in
Exhibit 2 describes a simple website. A user typically clicks a link on the homepage (E) for 70% of the time that leads
her to page (A), while the remaining 30% of the time, the user clicks a link that keeps her on the same page (E).
Similarly, once the user is on page (A), 40% of the times, the user clicks another link back to (E) and the remaining 60%
of the time the user clicks a link that keeps her on the same page (E). The Markov chain can help us find the
probabilities of a random user being present on any page after X number of iterations of this chain. The website
administrator might want to use this information in order to decide as to which page to focus on for maximising his ad
revenue. Please note that Google’s implementation of the Markov Chain is that of a Non-Absrobing Markov Chain.

In order to solve this problem, we start by using the tree method of calculating 2nd level probability Pij (2) i.e. the
probability of going from any node i to j in the 2nd iteration, where i, j belong to E or A as given in Exhibit 4. Here we
observe that the probability of landing on the page A are now 63% and 64% respectively if the user was at E and A
respectively at the end of the first iteration. Following this method, if we continue working for up to 7 iterations, we
will realize that the probability values have reached a steady state and do not change anymore.

In order to find the steady state probability values of both the webpages, we use the steady state equation of π =
π*P and solve as shown in Exhibit 5. This establishes the Eigen values of πA and πE as 0.63 and 0.37 respectively.
Therefore, we can recommend that it is wiser to spend advertising effort on the page A since in the long run it is twice
as likely to attract clicks as page E. As we progress towards looking at how Google ranks pages according to their
relevance, it will be interesting to note that their Eigen values play a significant part.

Markov chains have significant use in industrial research, organization behaviour, financial markets analysis, human
resource planning, marketing forecast etc. A very interesting use of Markov’s chain has been in the music industry. As
early as in the 1950s, music composers used the Markov Chain to study the pattern of notes in popular songs3 and
thereby create new music sequences based on the studied musical notes.

The example of linked webpages that we discussed above can now be extrapolated to calculate the probability of
arriving at any webpage for a certain search criteria, if the entire World Wide Web is considered as a large connected,
memoryless chain. Based on the relevance criterion, we can estimate the highest relevance factor, and therefore any
page’s utility rank for a search string. This is the rationale behind Google’s patented PageRank algorithm.

1
Weisstein, Eric W. "Markov Chain." From MathWorld - A Wolfram Web Resource. http://mathworld.wolfram.com/MarkovChain.html
2
Tamara Lynn Anthony, Rice University: Markov Chains
3
Verbeurgt Karsten, Dinolfo Michael, Fayer Mikhail: Extracting Patterns in Music for Composition via Markov Chains



Google’s PageRank algorithm4 is a stochastic algorithm that determines the significance of a page relative to a search
string. This is not the only factor that Google adopts to rank pages, but it is an important one. For Google (or for a web
administrator), the PageRank of a page denotes the real probability of a random web surfer reaching that page after
clicking on many links. The PageRanks form a probability distribution over web pages, explaining why the sum of
PageRank of all pages is 1. Refer to Exhibit 6 for a mathematical representation of the PageRank algorithm. Essentially,
the Google PageRank method will rank those pages higher (i.e. more important) that have links to other higher ranked
or more important pages.

Let us explain the algorithm with a real-life example: One of the co-authors of this report is an active Technology
blogger and writes a blog called “The TechBend” at www.KundanBhaduri.com. Exhibit 7 shows that the Google
PageRank of the search string “Techbend blog” is highest for www.KundanBhaduri.com and it thus appears on top of
Google’s search results. Interestingly, while there are other professional sites and blogs with domain names such as
www.TechBend.com etc, yet they do not figure anywhere close to the top of the search results on Google. Let us
explore how this was achieved using the application of Markov Chain.

Holistically, the internet as we know is a connected graph of interlinked webpages (Exhibit 8). Therefore, it will have
an exhaustively large transition probability matrix. One look at Exhibit 9 tells us that for the homepage of The
Techbend to rank high on Google’s PageRank, its Eigen value has to be higher than all other competing webpages that
have the same context. More specifically, Eigen values on connections to those nodes (webpages) in the matrix have
to be high which themselves have high Eigen values with other connections. In other words, the probability of reaching
our target page will be high when coming from another high-probability page. We tested this logic with Exhibits 3 and
5 where we saw that A achieved a higher Eigen value because it was more probable to arrive at A from E or to remain
on A itself. This logic is at the core of Google’s PageRank.

In our example, www.KundanBhaduri.com does achieve a higher PageRank by linking itself with other highly
prominent websites such as Techcrunch, Engadget and TED. Since these sites enjoy a higher PageRank, by linking
themselves back to The Techbend Blog, the overall probability of a random surfer arriving at www.KundanBhaduri.com
is higher than it is for www.TechBend.com. This is explained by a higher Eigen Value (Exhibit 10) and therefore a
higher PageRank for The Techbend. An important factor that needs to be emphasized here is that it is not just about
the number of links that a webpage exchanges with another but its relative importance in the universe of all such links.

Issues to be addressed

However, since the internet is an exhaustively large set of nodes (over 1 trillion)5, there are some issues that need to
be addressed to make the Markov Chain model functional for Google PageRank. Firstly, the calculation of the Eigen
Vector for such a large (and growing) matrix is non-trivial. We will address this issue in the second part of the report.

Other than that, the issues related to handling dangling nodes (i.e. dead pages) and calculating an appropriate
damping factor are significant. The damping factor refers to the probability that the random user will not abruptly end
the session (by either exiting the browser or typing a new URL). In order to avoid a situation of creating an absorbing
Markov chain, pages with no outbound links are assumed to link out to all other pages in the collection. Their
PageRank scores are therefore divided evenly amongst all other pages.

Calculating the preliminary transition matrix of the web is also a significant challenge given the massive size of the
worldwide web. Therefore, a workaround to this problem is by ‘guessing’ the transition matrix and then progressively
correcting the value. Since Google recalculates the PageRanks every time it crawls through the web, its approximation
decreases with each iteration.

4
Hwai-Hui Fu , Dennis K. J. Lin and Hsien-Tang Tsai (Dept. of Bus. Administration, Shu-Te University): Applied Stochastic models in Business and Industry
5
Alpert, Jesse, and Nissan Hajaj. "We Knew the Web Was Big..." Official Google Blog. 25 July 2008. Web. 06 Feb. 2011.



Techniques that may be used to overcome the problem of solving such a large
system:

Now that we understand how Google was able to apply a form of Markov Chain modelling to create their PageRank system,
we will address one of the most significant problems they faced, solving the system π = π P. Solving this equation in a small
matrix we can quickly find exact solutions. When the web was much smaller, Google could compute the steady state vector
of 26 million pages in about 2 hours6. The resulting computation would then be used for a fixed period of time. However,
because of the sheer size of the World Wide Web, which Google asserts the number of websites is now over the 1 trillion
mark7, the resulting stochastic matrix will now contain over a trillion rows and columns.

Additionally, given the dynamics of Web 2.0, it would no longer be efficient for Google to use the stale data from these
computations for a fixed time interval. “Today, Google downloads the web continuously, collecting updated page
information and re-processing the entire web-link graph several times per day”8. In sum, the ever changing, and ever
expanding nature of the World Wide Web and its content, coupled with the search engine’s commitment to provide the
best information available, only serves to multiply exponentially the problem of solving the aforementioned system.

If you think about it, the resulting matrix of the web, with it’s over a trillion columns and rows, is going to be composed
mostly of zeroes, given that most webpages link to a very tiny and limited number of additional web pages. In fact, a 2004
study shows that the average number of out-links from a given webpage is just 52, hence only 52 of the remaining trillion
elements are non-zero.9 This means that the web matrix is very sparse.

In order to solve this problem, one of the main tools that can be used (or a variation thereof that Google appears to have
implemented), is called “The Power Method” or “Power Iteration”. This method applied to the Google matrix will converge
to the PageRank vector, in other words, it will ultimately help us define the weighting or importance of our webpages
relative to the entire matrix. The power method is an iterative process for approximating eigenvalues; we will use this
method to find our dominant Eigenvalue and Eigenvector. “Eigenvectors of a square matrix are the non-zero vectors
which, after being multiplied by the matrix, remain proportional to the original vector".10 In order to implement this
method, we must assume that our matrix, which we will now refer to as matrix A, has a dominant eigenvalue with
corresponding dominant eigenvectors. The dominant eigenvector of a matrix is an eigenvector corresponding to the
eigenvalue of largest magnitude of that matrix. In order to approximate a dominant eigenvector we choose an initial
approximation of one of the dominant eigenvectors of A, which we will call π 0. Then we can form the following sequence11:

π 1 = A π0
π 2 = A π 1 = A(A π 0) = A2 π 0
π 3 = A π 2 = A(A2 π 0) = A3 π 0
⁞
π k = A π k-1 = A(Ak-1 π 0) = Ak π 0

For large powers of k, this method provides a good approximation of the dominant eigenvector in matrix A. The method
requires successive iterates until some convergence criterion is satisfied. With our dominant eigenvector, we can find our
dominant eigenvalue using the Rayleigh quotient, as follows12:

6
Alpert, Jesse, and Nissan Hajaj. "We Knew the Web Was Big..." Official Google Blog. 25 July 2008. Web. 06 Feb. 2011.
<http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html>.
7
Ibid.
8
Ibid.
9
Anuj Nanavati, Arindam Chakraborty, David Deangelis, Hasrat Godil, and Thomas D’Silva, An investigation of documents on the World Wide Web,
h p://www.iit.edu/˜dsiltho/Inves ga on.pdf, December 2004.
10 "Eigenvalues and Eigenvectors." Wikipedia, the Free Encyclopedia. 27 Sept. 2010. Web. 10 Feb. 2011.
http://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors.
11
Larson, Ron, David C. Falvo, and Bruce H. Edwards. Elementary Linear Algebra. Boston: Houghton Mifflin, 2004. 550-58. Print.
12
Ibid.



λ= Aπ ∙π
___________________________

π∙π

“In cases for which the power method generates a good approximation of a dominant eigenvector, the Rayleigh quotient
provides a correspondingly good approximation of the dominant eigenvalue”13.

One of the unique features of the Google matrix, as we briefly mentioned before, is that the total number of nonzero
elements in a given row is quite small (due to the small number of hyperlinks that a given webpage might contain) (Exhibit
11). Since all our computations involve this sparse matrix multiplied by vectors, an iteration of the power method is
considered very cheap14.

Another necessary technique Google implemented to make this system solvable was the fix to the dangling node problem.
What happens when a user arrives at a webpage that does not link out to another webpage? Does our random surfer
become absorbed by this webpage, does he never leave? This is the dangling node problem, for which our Markov Chain
could categorize these nodes as absorbing states, unless we do something to correct this situation. Suppose the Google
Matrix was called Matrix H. In order to correct for this, we could create a new matrix S = H + dw, where d is a column
vector that identifies dangling nodes and assigns either a 1 if the node is dangling or a 0 otherwise, and w is a row vector
(w1, w2, …, wn) used to determine where our random surfer will go in order to not become absorbed. One way of assigning
value to this row vector is to say that there is equal probability our surfer will land on any of the n webpages that exist, so
the row for w would look like this: ( … ). Whilst there are other ways to assign w, this is the most common, and is
sufficient for our purposes.

Another important technique that may be used by Google to help solve the system is the inclusion of a damping factor. The
damping factor is added in to account for the possibility that a given web surfer may at any time choose not to follow the
links on a given webpage that are available to him and type in any URL in order to go to a page that is out of the current
chain. In fact, Brin and Page reference the damping factor in their original paper on Google (submitted while at Stanford),
“The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85”15.

While the damping factor is intended to model the behaviour of a random web surfer, it also serves the additional
purpose of speeding up convergence of the power method. This is because the ratio of the two eigenvalues largest in
magnitude of the matrix determine how quickly the method converges16. It has been proven that the second largest
eigenvalue of the Google matrix is less than or equal to the damping factor used17. The power method converges quickly
when the damping factor is less than 1. According to Rebecca Wills, only 29 iterations are required for the difference
between iterates to become less than 10-2 when using a damping factor of 0.85, the number of iterations goes up to 44
when the damping factor is raised to 0.9018. Hence, the damping factor increases/speeds the solvability of this complex
system by reducing the iterations necessary to assign PageRank vectors.

While Google’s problem of solving this enormous system is certainly no easy task, especially not at the speed that they
might require. They have been able to overcome these significant obstacles through the unique application of certain
existing mathematical algorithms.

13
Larson, Ron, David C. Falvo, and Bruce H. Edwards. Elementary Linear Algebra. Boston: Houghton Mifflin, 2004. 550-58. Print.
14
Wills, Rebecca. “Google’s PageRank: The Math Behind the Search Engine.” The Mathematical Intelligencer 28.4 (Fall 2006): 6-11.
15
Brin, S., and Page L.. "The Anatomy of a Large-scale Hypertextual Web Search Engine." Computer Networks and ISDN Systems 30.1-7 (1998): 107-17. Print.
16
Gene H. Golub and Charles F. Van Loan, Matrix computations, 3rd ed., The Johns Hopkins University Press, 1996.
17
Taher H. Haveliwala and Sepandar D. Kamvar, The second eigenvalue of the Google matrix, Tech. report, Stanford University, 2003.
18
Wills, Rebecca. “Google’s PageRank: The Math Behind the Search Engine.” The Mathematical Intelligencer 28.4 (Fall 2006): 6-11.



Exhibit 1: A sample 4-state Markov chain with transition probabilities

P11

1 P12

2
P23

P24
P41
3
4
P34

Exhibit 2: Sample 4X4 transition Matrix

Exhibit 3: Explaining the basis of Markov’s chain19

19
Image taken from http://en.wikipedia.org/wiki/Markov_chain



Exhibit 4: Demonstrating the stable state values using simple matrix
multiplication

0.3 0.7
0.4 0.6
P=

Pij (2) = |P|2ij

0.3 0.7 0.3 0.7 0.37 0.63
0.4 0.6 0.4 0.6 0.36 0.64
* =

P3 0.363 0.637
0.364 0.636

P4 0.3637 0.6363
0.3636 0.6364

P5 0.36363 0.63637
0.36364 0.63636

P6 0.363637 0.636363
0.363636 0.636364

P7 0.363636 0.636364 S
0.363636 0.636364 t
a
b
P8 0.363636 0.636364
l
0.363636 0.636364
e

P9 0.363636 0.636364 S
0.363636 0.636364 t
a
P10 0.363636 0.636364 t
e
0.363636 0.636364
V
P11 0.363636 0.636364 a
0.363636 0.636364 l
u
P12 0.363636 0.636364 e
0.363636 0.636364 s



Exhibit 5: Calculating the steady state eigen values πA and πE

π = π*P

0.3 0.7
Therefore, π π = π π *
0.4 0.6

Solving these two equations:

1. π = 0.3*π +0.4*π
2. π = 0.7*π +0.6*π

Also, we know that:

3. πE + πA = 1

Since equations 1 & 2 are similar, solving equations 2 and 3 together:

π = 0.7*(1 − π ) +0.6*π

Or, = 0.63

And, = 0.37

Exhibit 6: The improved Google PageRank algorithm

1 ( ) ( ) ( )
PR(A) = 1 − ∗ + ∗ + + ⋯ +
∑ ( ) ∑ ( ) ( ) ( ) ( )

Where:

• PR(A) is the PageRank of page A
• PR(Ti) is the PageRank of pages Ti that link to page A
• C(Ti) is the number of outbound links on page Ti
• n is the total number of all pages that link to page A
• N is the total number of all pages on the web.

It is noteworthy that there is an adjusting damping factor involved in the calculation. The above equation represents
the final version of the PageRank algorithm with the damping factor being incorporated within the first argument on
the RHS of the equation.



Exhibit 7: PageRank of the search string ‘Techbend blog’

Exhibit 8: The correlation between a webpage and the rest of the web20

The importance of these links determines
the overall importance of your webpage to
the PageRank algorithm

20
Laure Ninove, Cristobald de Kerchove , Paul Van Dooren: Université Catholique de Louvain
http://www.esat.kuleuven.be/scd/golub/presentations/Gene_PVD.pdf



Exhibit 9: KundanBhaduri.com and its links to other sites

TechCrunch
Very high PageRank

Rest of
Internet

Engadget
Very high PageRank

The homepage of KundanBhaduri.com
hosts the blog The TechBend

TED
Very high PageRank



Exhibit 10: Applying Markov Chain method to calculate the PageRank for
‘TechBend blog’

Following is the probability matrix that shows the likelihood of a user clicking on a page to arrive at the homepage of
another website when she is searching for the string “TechBend blog”. All site names here refer to their respective
homepages, for the purpose of Markov chain analysis.

m

et
com
.co

er n
m

com
.co
uri

m

I nt
ch.
t.
ad

o
nd

ge

D.c

n

the
…
Bh

Be

Cru
gad

TE
an

ch

of
ch
En
nd

Te

st
Te

Re
Ku

KundanBhaduri.com 0.6 0.3 0.01 0.03 0.01 … …
TechBend.com 0.42 0.1 0.12 0.01 0.11 … …
Engadget.com 0.65 0.02 0.1 0.21 0.01 … …
TED.com 0.54 0.22 0.1 0 0.09 … …
TechCrunch.com 0.64 0.17 0.13 0.01 0 … …
… 0.59 0.31 0.02 0.04 0.01 … …
Rest of the Internet … … … … … … …

Transition Probabilities of
KundanBhaduri.com and TechBend.com

For the Stable-state matrix π = π*P (1)
We assume:
Webpage Eigen Value
KundanBhaduri.com πA
TechBend.com πB
Engadget.com πC
TED.com πD
TechCrunch.com πE

Therefore using (1), we get:
πA = πA *0.6 + πB*0.42 + πC*0.65 + πD*0.54 + πE*0.64 + …*0.59 + … (2)

πB = πA *0.3 + πB*0.1 + πC*0.02 + πD*0.22 + πE*0.17 + …*0.31 + … (3)

It is clear from equations (2) and (3) that πA >> πB considering that there are no other webpages on the internet
that are more important (i.e. have higher probability rank) than the pages described in the above table.
Therefore, we conclude that KundanBhaduri.com will have a higher PageRank than TechBend.com for the search
term ‘TechBend blog’



Exhibit 11: Computing a small Eigen value with Power Method

We know that: π = π*P
For a hypothetical π of the order 20X20, notice that most of the nodes are zero. This considerably reduces
the total cost of computing the π*P value, since sum of all the zero valued π row/column values will be zero.

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 4 0 0 0 4 0 0 4 0 0 9 0 0 7 0 0 0 1
0 9 0 0 6 0 0 12 0 8 0 0 8 0 0 5 0 0 2 0
8 0 7 0 0 8 0 0 4 0 2 0 2 0 5 0 0 6 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 3
0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 4 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 6 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0
0 0 5 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0
0 0 0 0 0 0 8 0 8 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0
0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 5 0 7 8 0 6 0 6 0 8 1 0
0 0 0 0 0 0 0 8 0 0 0 0 9 0 0 0 2 0 1 0
0 0 0 0 0 8 0 0 0 0 0 0 0 0 7 0 0 0 0 0
0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 5 5 0 0 0 0 0 8 0 0 7

Therefore πA = ∑ ∗ for value of j = 1 and k belongs to a value between πA to πB

Since most of the values of the above terms are zero, we only need to count for rows 1 and 4 from the table
above. Therefore, πA = 1 * πA + 8 * πD

This helps us solve a large Markov transition probability matrix in a trivial way.


The Google Pagerank algorithm - How does it work?

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie The Google Pagerank algorithm - How does it work?

Ähnlich wie The Google Pagerank algorithm - How does it work? (20)

Mehr von Kundan Bhaduri

Mehr von Kundan Bhaduri (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The Google Pagerank algorithm - How does it work?