1. Jung Hoon Kim
N5, Room 2239
E-mail: junghoon.kim@kaist.ac.kr
2014.01.14
KAIST Knowledge Service Engineering
Data Mining Lab.
1
2. Introduction
First introduced by Sergey Brin & Larry Page in 1998
Original ranking algorithm didn’t suitable for web in 1996
# of Web pages grew rapidly
in 1996, query “classification technique” => 10 million relevant
page searched!
content similarity method are easily spammed
vulnerable for spam page
KAIST Knowledge Service Engineering
Data Mining Lab.
2
3. Basic
page rank algorithm has two principle
A hyperlink from a page pointing to another page is an
implicit conveyance of authority to the target page.
thus, the more in-links that a page i receives, the more
prestige the page i has
Pages that point to page i also have their own prestige
score. A page with higher prestige score pointing to i is
more important than a page with a lower prestige score
pointing to i
KAIST Knowledge Service Engineering
Data Mining Lab.
3
4. principle
hyperlink trick
many incident node means more important
KAIST Knowledge Service Engineering
Data Mining Lab.
4
5. Authority
more authority people say .. is more important
John is computer scientist
Alice is cooker
KAIST Knowledge Service Engineering
Data Mining Lab.
5
6. Big picture
big picture
famous person is means having many incident edges
KAIST Knowledge Service Engineering
Data Mining Lab.
6
7. Cyclic problem
In web, there are many cycles like this
this matrix has cycle A->B->E
it means the score is increased by infinitely
KAIST Knowledge Service Engineering
Data Mining Lab.
7
8. Random suffer trick
To avoid many problem and many reason
they adapted random surfer
each node can ability to move any node
it can solve cycle problem
high incident node can have high rank
sometimes it called as damping factor(d)
by google initial model, d = 0.15
KAIST Knowledge Service Engineering
Data Mining Lab.
8
9. Test
1000 times test result
nearly correct ;
D, A has high rank
A has only one incident link
To easily identify rank, to
express percentage is good
methods
KAIST Knowledge Service Engineering
Data Mining Lab.
9
13. Formula
in mathematically, we have a system of n linear
equations.
P=(P1, P2, P3 , … Pn)
A is adjacent matrix, so we can make this formula
KAIST Knowledge Service Engineering
Data Mining Lab.
13
15. Linear Algebra
formula
P is an eigenvector with the corresponding eigenvalue of 1.
1 is the largest eigenvalue and the PageRank vector P is the
principle eigenvector
to calculate P, we can use power iteration algorithm
KAIST Knowledge Service Engineering
Data Mining Lab.
15
16. Condition
but the conditions are that A is a stochastic matrix and
that it is irreducible and aperiodic
We can see the graph model as markov model
each web page is node and hyperlink is transition
A is not a stochastic matrix, because there are zero
row(5). zero row means no out-link.
So we fix the problem by adding a complete set of outgoing
links from each such page i to all the pages on the Web
KAIST Knowledge Service Engineering
Data Mining Lab.
16
18. irreducible
if there is no path from u to v, A is not irreducible because
of some pair of nodes u and v.
if there are path u to v, A is irreducible!
A state i is periodic with period k > 1 if k is the smallest
number such that all paths leading from state i back to
state i have a length that is a multiple of k. If a state is not
periodic, A markov chain is aperiodic if all states are
aperiodic
KAIST Knowledge Service Engineering
Data Mining Lab.
18
19. Page Rank
It is easy to deal with the above two problems with a
single strategy
We add a link from each page to every page and give each
link a small transition probability controlled by a parameter
d
KAIST Knowledge Service Engineering
Data Mining Lab.
19
20. Page Rank
The computation of pagerank values of the Web pages can
be done using the power iteration method, which produces
the principal eigenvector with an eigenvalue of 1
The iteration ends when the PageRank values do not
change much or converge.
KAIST Knowledge Service Engineering
Data Mining Lab.
20
21. Real Page rank
To deal with web spam is most important thing
give equal random surfer constants and calculate all the
page needs to many times to calculate it
Currently, Google use more 200 factors to calculate
ranking in web
KAIST Knowledge Service Engineering
Data Mining Lab.
21