Ranking nodes in growing networks: when PageRank fails

Ranking nodes in growing networks:
When PageRank fails
Pietro De Nicolao
pietro.denicolao@mail.polimi.it
Politecnico di Milano
April 14, 2016

Outline
Introduction
A growing network model: the Relevance Model
Real data analysis
Conclusions

Paper
www.nature.com/scientificreports
Ranking nodes in growing
networks: When PageRank fails
Manuel Sebastian Mariani1
, Matúš Medo1
& Yi-Cheng Zhang1,2
PageRank is arguably the most popular ranking algorithm which is being applied in real systems
ranging from information to biological and infrastructure networks. Despite its outstanding
popularity and broad use in different areas of science, the relation between the algorithm’s efficacy
and properties of the network on which it acts has not yet been fully understood. We study here
PageRank’s performance on a network model supported by real data, and show that realistic
temporal effects make PageRank fail in individuating the most valuable nodes for a broad range of
model parameters. Results on real data are in qualitative agreement with our model-based findings.
This failure of PageRank reveals that the static approach to information filtering is inappropriate for
a broad class of growing systems, and suggest that time-dependent algorithms that are based on the
temporal linking patterns of these systems are needed to better rank the nodes.
received: 06 June 2015
accepted: 10 August 2015
Published: 10 November 2015
OPEN
Published on Nature Scientiﬁc Reports on 10 November 2015.

PageRank: recap
Most popular ranking algorithm for unipartite directed
networks.
Invented for Google’s search algorithm
Also used for the ranking of:
scholarly papers
images in search
urban roads according to traﬃc ﬂow
proteins in their interaction network
etc.
A node is important if it is pointed by other important
nodes.
pij = (1 − γ)
wij
sout
i
+ γ
1
N

One PageRank ﬁts all?
What is the relation between PageRank’s eﬃcacy and the
properties of the network?
PageRank: static approach
PageRank discards temporal information
works as if nodes appear all at the same time
well-known bias towards old nodes
Theoretical models and real networks can exhibit
strong temporal patterns.
Are there circumstances under which
the algorithm is doomed to fail?

The Relevance Model (RM)
What is the Relevance Model?
growing directed network model
with preferential attachment and relevance
generalizes the classical Barabási-Albert model
introduced by [Medo, 2011]
Why using RM?
model that best explains the linking patterns in real networks
used to model WWW, citation and technological networks

Relevance Model features
1. Preferential attachment
similar to the Barabási-Albert model
Matthew effect: the rich get richer
significant difference: existing nodes also create new links
2. Fitness
quality parameter assigned to each node
node’s inherent competence in attracting new incoming links
concept formerly explored in [Bianconi-Barabási, 2001]
3. Relevance and activity
Relevance: capacity of attracting new links over time
Activity: rate at which the node generates new outgoing links
4. Temporal decay
Relevance and activity both decay with time
Monotonous function of choice (exponential, power law)
Real-world phenomenon: nodes lose relevance over time

Relevance decay: a real-world example
0.01
0.1
1
10
100
1000
5 10 15 20 25 30 35 40 45 50
relevance
years after publication
indegree>100
10<=indegree<100
indegree<10
emporal decay of the average relevance r(t) in the APS datas
color online). Symbols represent the average relevance of nodes belongi
or bars represent the error of the mean, lines represent the fits described
on-monotonous part of the relavance profile is ignored by the fitting
Figure: Temporal decay of the average (empirical) relevance r(t) of
papers in the American Physical Society citation network (1893-2009).
This behaviour has been formerly highlighted in [Medo, 2011].

How to build a network with the Relevance Model
At each discrete time interval t, the generation algorithm
proceeds as follows:
1. a new node is created and connected to an existing node i,
chosen with probability Πin
i (t).
2. If t > 10, then m = 10 existing nodes are sequentially chosen
with probability Πout
i (t) and become active:
each selected node creates one outgoing link
it selects a node j as a target with probability Πin
j (t)
No multiple links.
No self loops.

Relevance Model: Link generation mechanism
The probability for the node i to be the target of a new link created
at time t is:
Πin
i (t) ∼ (kin
i (t) + 1) ηi fR(t − τi)
kin
i (t): current indegree of node i
ηi: ﬁtness of i
τi: time at which i enters the network
fR: monotonously decaying function of time
Ri(t) := ηifR(t − τi): relevance of node i at time t.

Relevance Model: Active nodes selection
In the RM, nodes continue being active and generate outgoing links
continually.
Probability for node i to be chosen as an active node at time t:
Πout
i (t) ∼ Ai fA(t − τi)
Ai: activity parameter
τi: time at which i enters the network
fA: monotonously decaying function of time

Eﬀects of relevance decay in the RM
Slow or absent relevance decay
recent nodes receive few links because of preferential
attachment
PageRank’s bias towards old nodes in scale-free networks
Fast relevance decay
preferential attachment compensated by decay of relevance of
old nodes
recent nodes can reach high indegree
recent nodes mostly point to other recent nodes, because of
relevance decay of older nodes
old nodes point to nodes of every age because of activity

What makes a ranking algorithm “good”?
A good ranking algorithm is expected to produce
an unbiased ranking where both recent and old nodes
have the same chance to appear at the top.
In growing networks with temporal eﬀects, PageRank can fail
to achieve this.
Let’s compare PageRank with the elementary indegree ranking.

PageRank time bias: numerical simulation of RM
om/scientificreports/
0
2000
4000
6000
8000
10000
10 100 1000 10000
averageτoftop1%nodes
θR
indegree
pageRank
no bias
Figure 2. PageRank time bias. We show here the average entrance time τ of the top 1% nodes of the node
ranking by indegree and PageRank, respectively, as a function of the relevance decay parameter θR. Networks
of =N 10000 nodes are grown with the RM with slow decay of activity (θ = NA ). Two limits of PageRank
bias are visible: (1) When the decay of relevance is fast θ θ( )R A , a large number of top nodes are recent as
a consequence of the network structure demonstrated in Fig. 1; (2) When the decay of relevance is slow
(θ ∼ NR ), top nodes are old because the old nodes can be pointed by nodes of every age. While the latter
bias is common to PageRank and indegree, the former bias is specific to PageRank because of its network
nature.
age
relevance
decayfast slow
oldrecent
Relevance decays as
fR(t) = exp(− t
θR
).
Activity decays
exponentially, but very
slowly (θA = N).
N = 10000
Figure: Average time of entrance of 1% of nodes of PageRank and
indegree rankings, in the RM model.

PageRank vs. indegree: correlation with fitness in RM
in this limit as the average entrance time τ of the top-1% nodes is close to / =N 2 5000 which corre-
sponds to the absence of time bias.
(θ ∼ NR ), top nodes are old because the old nodes can be pointed by nodes of every age. While the latter
bias is common to PageRank and indegree, the former bias is specific to PageRank because of its network
nature.
Figure 3. A comparison of performance of PageRank and indegree in the RM data (N=10,000.
ρ(η)=exp(−η). The heatmap shows the ratio η η( , )/ ( , )r p r kin
. The black dotted line represents the
contour along which PageRank is not temporally biased (see Fig. S6, left). The upward bending of this
contour is a finite-size effect.
r(p, η): correlation
PageRank-fitness
r(kin
, η): correlation
indegree-fitness
ρ(η) = exp(−η)
ρ(A) = 2A−3
, A ∈ [1, ∞]
Figure: Comparison of performance of PageRank and indegree (RM data).
PageRank yields no improvement with respect to indegree.
Diagonal: no temporal bias towards recent or old nodes.

Real networks studied in the paper
Real networks studied (directed, unweighted):
Digg.com: social bookmarking site
Nodes: Digg users
Edges: aij = 1 ⇔ “i is a follower of j”.
N = 190 553; L = 1 552 905
American Physical Society (APS) articles and citation network
Nodes: papers (from 1893 to 2009)
Edges: aij = 1 ⇔ “i cites j”
N = 450 056; L = 4 690 967
Estimator of node ﬁtness: total relevance of node i
Ti =
t
ri(t)
To validate hypothesis of relevance and activity decay:
measurement of empirical relevance (see appendix).

PageRank performance on real data
cientificreports/
0
0.1
0.2
0.3
0.4
0.5
0.6
Digg.com Digg-calibrated RM APS APS-calibrated RM
correlationwithtotalrelevanceT
indegree
PageRank
Figure 5. A comparison of PageRank and indegree correlation with total relevance in real data and in
calibrated simulations with the RM. PageRank is outperformed by indegree in both datasets (and in the
corresponding calibrated simulations). In the Digg.com social network, the fitted relevance and activity
power-law decay exponents are not far from the parameter region where PageRank is maximally correlated
with indegree in numerical simulations with the RM with power-law decay (see Fig. S10), and PageRank’s
and indegree’s correlation with total relevance are close to each other. By contrast, in the APS dataset activity
decays immediately, whereas relevance decays progressively (see Fig. S2); as a consequence, PageRank is
strongly biased towards old nodes (see Fig. S3) and is outperformed by indegree by a factor 2.58
[ ( , ) = .r p T 0 19 whereas ( , ) = .r k T 0 49in
]. We refer to the Supplementary Note S3 for details about the
Digg.com: activity and
relevance decay s.t.
PageRank is maximally
correlated with indegree in
RM simulations with
power-law decay.
APS:
activity decays immediately,
relevance decays progressively.
Figure: Comparison of PageRank and indegree correlation with total
relevance Ti in real data. APS: PageRank strongly biased towards old
nodes, because papers can only be cited by more recent papers.

Important ﬁndings
PageRank can underperform w.r.t. indegree ranking
Mismatch between relevance and activity decay timescales
leads to time bias in PageRank:
towards recent nodes if decay of relevance is faster
towards old nodes if decay of activity is faster
Findings are robust with respect to:
form of decay function
distribution of ﬁtness among the nodes
metric used to evaluate the algorithm
Link timestamps are crucial for this analysis
Method can not be applied to undirected networks

In conclusion. . .
PageRank, despite its popularity and robustness, can fail
and thus it should not be used without carefully
considering the temporal properties of the system to
which it is to be applied.

Bibliography
Mariani M. S., Medo M., Zhang Y.
Ranking nodes in growing networks: When PageRank fails.
Scientiﬁc Reports 5, 16181;
doi: 10.1038/srep16181 (2015).
G. Bianconi, A. L. Barabási
Competition and Multiscaling in evolving networks
Europhysics Letters, Vol. 54 (2001), pp. 436-442
doi:10.1209/epl/i2001-00260-6
Medo M., Cimini G., Gualdi S.
Temporal Eﬀects in the Growth of Networks
Phys. Rev. Lett. 107, 238701 (2011-12-01)

Outline
Empirical relevance
The Extended Fitness Model

Empirical relevance: deﬁnition
The empirical relevance ri(t) of node i at time t is deﬁned as:
ri(t) =
ni(t)
nPA
i (t)
ni(t) =
∆kin
i (t,∆t)
L(t,∆t) : ratio between:
∆kin
i (t, ∆t): # of incoming links received by node i in the time
window [t, t + ∆t]
L(t, ∆t): total # of links created within the same time window
nPA
i (t) =
kin
i (t)
j kin
j (t)
: expected value of ni(t) according to
preferential attachment alone
ri(t) > 1 (< 1): node i at time t outperforms (underperforms) in
the competition for incoming links with respect to its preferencial
attachment weight.

The Extended Fitness Model
PageRank’s under-performance in time-dependent
networks is a general feature.
We can validate this using a model more compatible with the
idea that a node is important if it’s pointed by other important
nodes.
Extended Fitness Model (EFM)
High-fitness nodes are more sensitive to fitness than
low-fitness nodes, when choosing their outgoing links.
High-fitness nodes are then more likely to be pointed by other
high-fitness nodes than low-fitness nodes.
EFM is more favorable to PageRank than RM.

EFM: sensitivity to fitness
Probability Πin
i;j(t) that a link created by node j at time t ends in
node i:
Πin
i;j(t) ∼ (kin
i (t) + 1)1−ηj
η
ηj
i fR(t − τi)
Fitness η ∈ [0, 1] to prevent negative exponents
Πin depends on the fitness of the target and of the source
nodes (difference with RM).
kin
i (t): indegree of node i at time t.

PageRank vs. indegree: correlation with fitness in EFM
scientificreports/
with indegree along theθ θ−R A diagonal where PageRank is not temporally biased and η η( , )/ ( , )r p r kin
becomes close to, albeit always strictly lower than, one. When moving away from this diagonal, PageRank
Figure 4. A comparison of PageRank and indegree correlation with fitness in the EFM data (N=10,000,
H=250). The heatmap shows the ratio ( , )/ ( , )r p q r k qin
. The white dotted line represents the contour where
PageRank is not temporally biased (see Fig. S6, right).
ρ(A) = 2A−3
, A ∈ [1, ∞]
H nodes: high fitness;
η ∈ [10−5
, 1]
(N − H) nodes: low
fitness; η ∈ [0, 10−5
]
N = 10000, H = 250
Figure: Comparison of performance of PageRank and indegree (EFM
data).

Ranking nodes in growing networks: when PageRank fails

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (17)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie Ranking nodes in growing networks: when PageRank fails

Ähnlich wie Ranking nodes in growing networks: when PageRank fails (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ranking nodes in growing networks: when PageRank fails