2. ACKNOWLEDGEMENT
We would like to give our special thanks to our mathematics teachers who
taught us the necessary knowledge to make this project a success. We are
thankful to the Nit Hamirpur Academic block for providing such an interesting
courses in our semester. This course helped to broaden our horizon and
enabled us to understand the importance for practical work/approach.
This project is basically blend of the academic knowldege for our course and
practical knowledge.
We have greatly benefited from this course and its knowledge will surely help
us in the future.
3. What is PageRank?
PageRank is an algorithm used by Google search to rank
web pages in their search result. It was named after
Lary Page one of the founder of Google. PageRank
works by counting the number of links to a page to
determine a rough estimate of how important the
website is. Currently, PageRank is not only algorithm
used by Google, but it is the first algorithm that was
used by the company and it is best known.
PageRank is a link analysis algorithm and it assigns a
numerical weighting to each element of hyperlinked
set of documents with the purpose of ‘measuring’ its
relative importance within set.
6. Algorithm
The PageRank algorithm outputs a probability distribution used to represent that a person randomly clicking
on links will arrive at any particular page. PageRank can be calculated for collections of documents of any
size. It is assumed that the distribution is evenly divided among all documents in the collection at the
beginning of the computational process. The PageRank computations require several passes, called
"iterations", through the collection to adjust approximate PageRank values to more closely reflect the
theoretical true value.
A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is commonly expressed as a
"50% chance" of something happening. Hence, a document with a PageRank of 0.5 means there is a 50%
chance that a person clicking on a random link will be directed to said document.
7. Simplified formula of PageRank
Assume a small universe of four web pages: A, B, C, and D. Links from a page to itself are ignored.
Multiple outbound links from one page to another page are treated as a single link. PageRank is initialized
to the same value for all pages i.e. ¼ or 0.25.
Simple diagram of web pages containing A,B,C and D. The arrows indicate
link from that page to other.
Here there are 1 outbound link from A, similarly outbound link from B, 3 from C, and 1 from D. So the
page rank of A will
R(A) = PR(B) + PR(C) + PR(D)
1 3 1
……..equ. 1
8. Calculating the PageRank via Iterations
• Suppose that we have same web graph as shown
in slide 7.
• At iteration 1 the pagerank of all websites are
equal i.e 1/n.
• At iteration 2 the pagerank (suppose A) depends
upon inbound link from other websites to given
website i.e.
𝑃𝑅 𝐴 =
1/4
3
• Its goes on same for other websites and next
iterations as shown in table.
• These iterations stores the pagerank of the
webpages.
• Here C has highest pagerank.
9. Limitations of Iterative method
It is suitable for large no. of websites containing thousands or more websites.
This method takes time
Making iterative table is troublesome.
10. PageRank with matrix representation
We can use matrix operations instead of iterative approach because we
can do multiple operations at same time. There are 3 methods
Power method
Steady-state method
Random surfer method
11. 1. Power method
Consider that we have webgraph given as shown. According to
graph the Transition matrix formed is H which is shown.
Now according to Power method
V2 = H × V
V2 = H × V2 = H2 × V
…
Vn = Hn × V
V: is the column matrix that contains initial PageRank of the pages.
H: is the transition matrix based on webgraph.
We can measure the ε error rate as the vn+1 – vn difference
between the adjacent page rank values. If this ε error is small
enough, then the algorithm terminates.
i.e. 𝒗 𝒏+𝟏 − 𝒗 𝒏 < 𝜺
12. 2. Steady-state method
Steady-state method is all about transforming in eigenvector-eigenvalue problem.
The steady-state is when the eigenvalue is equal to 1.
i.e. Hx = x
H: transition matrix x: final page rank matrix of given pages.
Here we have to solve an eigenvalue-eigenvector problem where the eigenvalue
is 1.Because when we apply the H matrix, then nothing happens (we end up with
the same v vector). Which means that this is steady-state. What’s essential that
there is no need for initialization (no need for the 1/n values) and no need to
make iterations.
13. 3. Random surfer modal
Assumptions
Importance of a web page is measured by its popularity i.e. how many
incoming links it has.
PageRank can be defined by the probability that a random surfer on the
web starts on a random page + follows hyperlinks and visits the given page.
The transition matrix (H) is useful as well. In this method we also have to
multiply v with H until it reaches the stationary state
i.e. Hv = v
Again we will end with same result as we in Steady state method.
14. Problems
The models we have discussed so far are not perfect, there are some
cases where these models fails completely and do not give correct
results. These cases are
Dangling nodes (nodes with no outgoing link)
Disconnected nodes
15. Dangling nodes (nodes with no
outgoing link)
The dangling nodes does not have any outgoing links.
Now if we try to find rank of webpages then
Fig: Dangling Node is 3
So in this case the rank of every page is 0. Something is not right, as page 3 has 2 incoming links, so it
must have some importance!
So in case of Dangling node our method fails.
16. Disconnected nodes
A random surfer that starts in the first connected component has no way
of getting to web page 5 since the nodes 1 and 2 have no links to node 5
that he can follow. Linear algebra also fails to help as well.
Fig: Disconnected nodes
In this case also our method discussed so far fails.
17. Damping Factor – Final PageRank model
In order to solve the problems as discussed above the concept of damping factor is
introduced in this model. This model also follows the assumptions of random surfer model.
Damping factor is the probability that random surfer leaves the given page and navigates to
a completely new one.
So the final PageRank formula is
G = (1-d)H + dB
G: this is the PageRank matrix or Google-matrix.
d: damping factor
H: transition matrix
B: matrix with all entries 1 i.e.
𝟏
𝐧
𝟏 𝟏 𝟏
𝟏 𝟏 𝟏
𝟏 𝟏 𝟏
18. G = (1-d)H + dB
If d is high, it means the random surfer navigates to new pages (teleportation) quite often.
If d is low, it means the random surfer has a tendency to follow links instead of B
15%chances that the surfer will leave page and 85%chances that surfer will follow links
given webpage.
Here M will have same features as H.
Some times with little probability the surfer leaves the
actual page and navigate to another one also called
“teleportation”.
If we have n websites then the probability of surfer to
any page is 1/n. That’s why B has 1/n term.
Most of the time, with (1-d) probability,
the surfer will follow links in the given
page. It will visit one of the neighbors
of the actual page.
19. Problem
The M matrix will be enormous with lots of rows and columns so we cannot handle it
properly.
That’s why we use power method approximation which is
Instead of initialize v vector with entries 1/n we use values 1 instead. For random
matrix it is not going to be any faster but Google-matrix is sparse. A given node has
small no. of outgoing links so it will work fine.
In view of everything discussed above, we conclude that:
Fact: The PageRank vector for a web graph with transition matrix A, and damping
factor p, is the unique probabilistic eigenvector of the matrix M, corresponding to
the eigenvalue 1.
20. Perron-Frobeius theorm
If M is a positive (the values are all greater than 0) and column stochastic matrix (so the sum of the
columns are 1) which is true in this case then according to Perron-Frobeiustheorm
1 is an eigen value of multiplicity one.
1 is the largest eigen value: all the other eigen values have absolute value smaller than 1.
the eigenvectors corresponding to the eigen value 1 have either only positive entries or only
negative entries. In particular, for the eigen value 1 there exists a unique eigenvector with the sum
of its entries equal to 1.
Intuitively, the matrix M "connects" the graph and gets rid of the dangling nodes. A node with no
outgoing edges has now probability to move to any other node.
Power method convergence theorem
It says if we have a matrix M which is positive, column stochastic ( sum of elements of column is 1),
then we can have w which is the eigenvector corresponding to the eigen value 1.
In that case the sequence v, Mv,M2 v …...Mk v converges to w. Here is going to store the pagerank of
all the websites in WWW.
Where v is the initial matrix with all entries equal to 1/n.
21. Pseudo Code
BEGIN
LOOP
i 0 to N
LOOP
j 0 to N
Initialize S
IF( S = 0) THEN
Aixj = 1/S
ELSE THEN
Aixj = 0
END LOOP
Initialize Val ,Vec, R
PRINT Vec
R 4 x 1 = { 1,1,1,1}
PRINT R
T 0
HERE N IS NO. OF NODES
A IS EMPTY MATRIX
S IS THE NO. OF
CONNECTION COMING FROM
SIDE
VAL IS EIGEN VALUE
WHICH ARE FOUND
MATHEMATICALLY OR
MANUALLY
VEC IS EIGEN VECTOR
FOUND MANUALLY
R IS A MATRIX OF ORDER 4 X
1
WITH ALL IDENTITY 1
T IS NO. OF TEST CASES
22. WHILE
LOOP
T<7
Rank= A x R
R =Rank
T=T+1
END LOOP
PRINT R
23. PROGRAM
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Jun 22 22:45:15 2020
@author: team17
"""
import numpy as np
n=int(input("no of nodes "))
a=np.eye(n,n)
for i in range(0,n):
for j in range(0,n):
s=int(input())
if(s!=0):
a[i][j]=1/s
else:
a[i][j]=0