SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Ranking nodes in growing networks:
When PageRank fails
Pietro De Nicolao
pietro.denicolao@mail.polimi.it
Politecnico di Milano
April 14, 2016
Outline
Introduction
A growing network model: the Relevance Model
Real data analysis
Conclusions
Outline
Introduction
A growing network model: the Relevance Model
Real data analysis
Conclusions
Paper
www.nature.com/scientificreports
Ranking nodes in growing
networks: When PageRank fails
Manuel Sebastian Mariani1
, Matúš Medo1
& Yi-Cheng Zhang1,2
PageRank is arguably the most popular ranking algorithm which is being applied in real systems
ranging from information to biological and infrastructure networks. Despite its outstanding
popularity and broad use in different areas of science, the relation between the algorithm’s efficacy
and properties of the network on which it acts has not yet been fully understood. We study here
PageRank’s performance on a network model supported by real data, and show that realistic
temporal effects make PageRank fail in individuating the most valuable nodes for a broad range of
model parameters. Results on real data are in qualitative agreement with our model-based findings.
This failure of PageRank reveals that the static approach to information filtering is inappropriate for
a broad class of growing systems, and suggest that time-dependent algorithms that are based on the
temporal linking patterns of these systems are needed to better rank the nodes.
received: 06 June 2015
accepted: 10 August 2015
Published: 10 November 2015
OPEN
Published on Nature Scientific Reports on 10 November 2015.
PageRank: recap
Most popular ranking algorithm for unipartite directed
networks.
Invented for Google’s search algorithm
Also used for the ranking of:
scholarly papers
images in search
urban roads according to traffic flow
proteins in their interaction network
etc.
A node is important if it is pointed by other important
nodes.
pij = (1 − γ)
wij
sout
i
+ γ
1
N
One PageRank fits all?
What is the relation between PageRank’s efficacy and the
properties of the network?
PageRank: static approach
PageRank discards temporal information
works as if nodes appear all at the same time
well-known bias towards old nodes
Theoretical models and real networks can exhibit
strong temporal patterns.
Are there circumstances under which
the algorithm is doomed to fail?
Outline
Introduction
A growing network model: the Relevance Model
Real data analysis
Conclusions
The Relevance Model (RM)
What is the Relevance Model?
growing directed network model
with preferential attachment and relevance
generalizes the classical Barabási-Albert model
introduced by [Medo, 2011]
Why using RM?
model that best explains the linking patterns in real networks
used to model WWW, citation and technological networks
Relevance Model features
1. Preferential attachment
similar to the Barabási-Albert model
Matthew effect: the rich get richer
significant difference: existing nodes also create new links
2. Fitness
quality parameter assigned to each node
node’s inherent competence in attracting new incoming links
concept formerly explored in [Bianconi-Barabási, 2001]
3. Relevance and activity
Relevance: capacity of attracting new links over time
Activity: rate at which the node generates new outgoing links
4. Temporal decay
Relevance and activity both decay with time
Monotonous function of choice (exponential, power law)
Real-world phenomenon: nodes lose relevance over time
Relevance decay: a real-world example
0.01
0.1
1
10
100
1000
5 10 15 20 25 30 35 40 45 50
relevance
years after publication
indegree>100
10<=indegree<100
indegree<10
emporal decay of the average relevance r(t) in the APS datas
color online). Symbols represent the average relevance of nodes belongi
or bars represent the error of the mean, lines represent the fits described
on-monotonous part of the relavance profile is ignored by the fitting
Figure: Temporal decay of the average (empirical) relevance r(t) of
papers in the American Physical Society citation network (1893-2009).
This behaviour has been formerly highlighted in [Medo, 2011].
How to build a network with the Relevance Model
At each discrete time interval t, the generation algorithm
proceeds as follows:
1. a new node is created and connected to an existing node i,
chosen with probability Πin
i (t).
2. If t > 10, then m = 10 existing nodes are sequentially chosen
with probability Πout
i (t) and become active:
each selected node creates one outgoing link
it selects a node j as a target with probability Πin
j (t)
No multiple links.
No self loops.
Relevance Model: Link generation mechanism
The probability for the node i to be the target of a new link created
at time t is:
Πin
i (t) ∼ (kin
i (t) + 1) ηi fR(t − τi)
kin
i (t): current indegree of node i
ηi: fitness of i
τi: time at which i enters the network
fR: monotonously decaying function of time
Ri(t) := ηifR(t − τi): relevance of node i at time t.
Relevance Model: Active nodes selection
In the RM, nodes continue being active and generate outgoing links
continually.
Probability for node i to be chosen as an active node at time t:
Πout
i (t) ∼ Ai fA(t − τi)
Ai: activity parameter
τi: time at which i enters the network
fA: monotonously decaying function of time
Effects of relevance decay in the RM
Slow or absent relevance decay
recent nodes receive few links because of preferential
attachment
PageRank’s bias towards old nodes in scale-free networks
Fast relevance decay
preferential attachment compensated by decay of relevance of
old nodes
recent nodes can reach high indegree
recent nodes mostly point to other recent nodes, because of
relevance decay of older nodes
old nodes point to nodes of every age because of activity
What makes a ranking algorithm “good”?
A good ranking algorithm is expected to produce
an unbiased ranking where both recent and old nodes
have the same chance to appear at the top.
In growing networks with temporal effects, PageRank can fail
to achieve this.
Let’s compare PageRank with the elementary indegree ranking.
PageRank time bias: numerical simulation of RM
om/scientificreports/
0
2000
4000
6000
8000
10000
10 100 1000 10000
averageτoftop1%nodes
θR
indegree
pageRank
no bias
Figure 2. PageRank time bias. We show here the average entrance time τ of the top 1% nodes of the node
ranking by indegree and PageRank, respectively, as a function of the relevance decay parameter θR. Networks
of =N 10000 nodes are grown with the RM with slow decay of activity (θ = NA ). Two limits of PageRank
bias are visible: (1) When the decay of relevance is fast θ θ( )R A , a large number of top nodes are recent as
a consequence of the network structure demonstrated in Fig. 1; (2) When the decay of relevance is slow
(θ ∼ NR ), top nodes are old because the old nodes can be pointed by nodes of every age. While the latter
bias is common to PageRank and indegree, the former bias is specific to PageRank because of its network
nature.
age
relevance
decayfast slow
oldrecent
Relevance decays as
fR(t) = exp(− t
θR
).
Activity decays
exponentially, but very
slowly (θA = N).
N = 10000
Figure: Average time of entrance of 1% of nodes of PageRank and
indegree rankings, in the RM model.
PageRank vs. indegree: correlation with fitness in RM
in this limit as the average entrance time τ of the top-1% nodes is close to / =N 2 5000 which corre-
sponds to the absence of time bias.
(θ ∼ NR ), top nodes are old because the old nodes can be pointed by nodes of every age. While the latter
bias is common to PageRank and indegree, the former bias is specific to PageRank because of its network
nature.
Figure 3. A comparison of performance of PageRank and indegree in the RM data (N=10,000.
ρ(η)=exp(−η). The heatmap shows the ratio η η( , )/ ( , )r p r kin
. The black dotted line represents the
contour along which PageRank is not temporally biased (see Fig. S6, left). The upward bending of this
contour is a finite-size effect.
r(p, η): correlation
PageRank-fitness
r(kin
, η): correlation
indegree-fitness
ρ(η) = exp(−η)
ρ(A) = 2A−3
, A ∈ [1, ∞]
Figure: Comparison of performance of PageRank and indegree (RM data).
PageRank yields no improvement with respect to indegree.
Diagonal: no temporal bias towards recent or old nodes.
Outline
Introduction
A growing network model: the Relevance Model
Real data analysis
Conclusions
Real networks studied in the paper
Real networks studied (directed, unweighted):
Digg.com: social bookmarking site
Nodes: Digg users
Edges: aij = 1 ⇔ “i is a follower of j”.
N = 190 553; L = 1 552 905
American Physical Society (APS) articles and citation network
Nodes: papers (from 1893 to 2009)
Edges: aij = 1 ⇔ “i cites j”
N = 450 056; L = 4 690 967
Estimator of node fitness: total relevance of node i
Ti =
t
ri(t)
To validate hypothesis of relevance and activity decay:
measurement of empirical relevance (see appendix).
PageRank performance on real data
cientificreports/
0
0.1
0.2
0.3
0.4
0.5
0.6
Digg.com Digg-calibrated RM APS APS-calibrated RM
correlationwithtotalrelevanceT
indegree
PageRank
Figure 5. A comparison of PageRank and indegree correlation with total relevance in real data and in
calibrated simulations with the RM. PageRank is outperformed by indegree in both datasets (and in the
corresponding calibrated simulations). In the Digg.com social network, the fitted relevance and activity
power-law decay exponents are not far from the parameter region where PageRank is maximally correlated
with indegree in numerical simulations with the RM with power-law decay (see Fig. S10), and PageRank’s
and indegree’s correlation with total relevance are close to each other. By contrast, in the APS dataset activity
decays immediately, whereas relevance decays progressively (see Fig. S2); as a consequence, PageRank is
strongly biased towards old nodes (see Fig. S3) and is outperformed by indegree by a factor 2.58
[ ( , ) = .r p T 0 19 whereas ( , ) = .r k T 0 49in
]. We refer to the Supplementary Note S3 for details about the
Digg.com: activity and
relevance decay s.t.
PageRank is maximally
correlated with indegree in
RM simulations with
power-law decay.
APS:
activity decays immediately,
relevance decays progressively.
Figure: Comparison of PageRank and indegree correlation with total
relevance Ti in real data. APS: PageRank strongly biased towards old
nodes, because papers can only be cited by more recent papers.
Outline
Introduction
A growing network model: the Relevance Model
Real data analysis
Conclusions
Important findings
PageRank can underperform w.r.t. indegree ranking
Mismatch between relevance and activity decay timescales
leads to time bias in PageRank:
towards recent nodes if decay of relevance is faster
towards old nodes if decay of activity is faster
Findings are robust with respect to:
form of decay function
distribution of fitness among the nodes
metric used to evaluate the algorithm
Link timestamps are crucial for this analysis
Method can not be applied to undirected networks
In conclusion. . .
PageRank, despite its popularity and robustness, can fail
and thus it should not be used without carefully
considering the temporal properties of the system to
which it is to be applied.
Bibliography
Mariani M. S., Medo M., Zhang Y.
Ranking nodes in growing networks: When PageRank fails.
Scientific Reports 5, 16181;
doi: 10.1038/srep16181 (2015).
G. Bianconi, A. L. Barabási
Competition and Multiscaling in evolving networks
Europhysics Letters, Vol. 54 (2001), pp. 436-442
doi:10.1209/epl/i2001-00260-6
Medo M., Cimini G., Gualdi S.
Temporal Effects in the Growth of Networks
Phys. Rev. Lett. 107, 238701 (2011-12-01)
Outline
Empirical relevance
The Extended Fitness Model
Empirical relevance: definition
The empirical relevance ri(t) of node i at time t is defined as:
ri(t) =
ni(t)
nPA
i (t)
ni(t) =
∆kin
i (t,∆t)
L(t,∆t) : ratio between:
∆kin
i (t, ∆t): # of incoming links received by node i in the time
window [t, t + ∆t]
L(t, ∆t): total # of links created within the same time window
nPA
i (t) =
kin
i (t)
j kin
j (t)
: expected value of ni(t) according to
preferential attachment alone
ri(t) > 1 (< 1): node i at time t outperforms (underperforms) in
the competition for incoming links with respect to its preferencial
attachment weight.
Outline
Empirical relevance
The Extended Fitness Model
The Extended Fitness Model
PageRank’s under-performance in time-dependent
networks is a general feature.
We can validate this using a model more compatible with the
idea that a node is important if it’s pointed by other important
nodes.
Extended Fitness Model (EFM)
High-fitness nodes are more sensitive to fitness than
low-fitness nodes, when choosing their outgoing links.
High-fitness nodes are then more likely to be pointed by other
high-fitness nodes than low-fitness nodes.
EFM is more favorable to PageRank than RM.
EFM: sensitivity to fitness
Probability Πin
i;j(t) that a link created by node j at time t ends in
node i:
Πin
i;j(t) ∼ (kin
i (t) + 1)1−ηj
η
ηj
i fR(t − τi)
Fitness η ∈ [0, 1] to prevent negative exponents
Πin depends on the fitness of the target and of the source
nodes (difference with RM).
kin
i (t): indegree of node i at time t.
PageRank vs. indegree: correlation with fitness in EFM
scientificreports/
with indegree along theθ θ−R A diagonal where PageRank is not temporally biased and η η( , )/ ( , )r p r kin
becomes close to, albeit always strictly lower than, one. When moving away from this diagonal, PageRank
Figure 4. A comparison of PageRank and indegree correlation with fitness in the EFM data (N=10,000,
H=250). The heatmap shows the ratio ( , )/ ( , )r p q r k qin
. The white dotted line represents the contour where
PageRank is not temporally biased (see Fig. S6, right).
ρ(A) = 2A−3
, A ∈ [1, ∞]
H nodes: high fitness;
η ∈ [10−5
, 1]
(N − H) nodes: low
fitness; η ∈ [0, 10−5
]
N = 10000, H = 250
Figure: Comparison of performance of PageRank and indegree (EFM
data).

Weitere ähnliche Inhalte

Was ist angesagt?

Mtp ppt soumya_sarkar
Mtp ppt soumya_sarkarMtp ppt soumya_sarkar
Mtp ppt soumya_sarkar
samarai_apoc
 
A new approach to erd s collaboration network using page rank
A new approach to erd s collaboration network using page rankA new approach to erd s collaboration network using page rank
A new approach to erd s collaboration network using page rank
Alexander Decker
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Tetsuya Sakai
 
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...
IJCSEA Journal
 
BigData_MultiDimensional_CaseStudy
BigData_MultiDimensional_CaseStudyBigData_MultiDimensional_CaseStudy
BigData_MultiDimensional_CaseStudy
vincentlaulagnet
 

Was ist angesagt? (17)

Data mining projects topics for java and dot net
Data mining projects topics for java and dot netData mining projects topics for java and dot net
Data mining projects topics for java and dot net
 
Cg4201552556
Cg4201552556Cg4201552556
Cg4201552556
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
 
Mtp ppt soumya_sarkar
Mtp ppt soumya_sarkarMtp ppt soumya_sarkar
Mtp ppt soumya_sarkar
 
Spatial approximate string search
Spatial approximate string searchSpatial approximate string search
Spatial approximate string search
 
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment ProblemAlgorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
 
A new approach to erd s collaboration network using page rank
A new approach to erd s collaboration network using page rankA new approach to erd s collaboration network using page rank
A new approach to erd s collaboration network using page rank
 
PageRank and Markov Chain
PageRank and Markov ChainPageRank and Markov Chain
PageRank and Markov Chain
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
 
A Comparison of Supervised Learning Classifiers for Link Discovery
A Comparison of Supervised Learning Classifiers for Link DiscoveryA Comparison of Supervised Learning Classifiers for Link Discovery
A Comparison of Supervised Learning Classifiers for Link Discovery
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
 
BigData_MultiDimensional_CaseStudy
BigData_MultiDimensional_CaseStudyBigData_MultiDimensional_CaseStudy
BigData_MultiDimensional_CaseStudy
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 

Andere mochten auch

PageRank Increase under Different Collusion Topologies (AIRWEB 2005)
PageRank Increase under Different Collusion Topologies (AIRWEB 2005)PageRank Increase under Different Collusion Topologies (AIRWEB 2005)
PageRank Increase under Different Collusion Topologies (AIRWEB 2005)
Carlos Castillo (ChaTo)
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data mining
Mai Mustafa
 

Andere mochten auch (8)

PageRank Increase under Different Collusion Topologies (AIRWEB 2005)
PageRank Increase under Different Collusion Topologies (AIRWEB 2005)PageRank Increase under Different Collusion Topologies (AIRWEB 2005)
PageRank Increase under Different Collusion Topologies (AIRWEB 2005)
 
TriggerScope: Towards Detecting Logic Bombs in Android Applications
TriggerScope: Towards Detecting Logic Bombs in Android ApplicationsTriggerScope: Towards Detecting Logic Bombs in Android Applications
TriggerScope: Towards Detecting Logic Bombs in Android Applications
 
Norme e vita scolastica del Liceo Einstein
Norme e vita scolastica del Liceo EinsteinNorme e vita scolastica del Liceo Einstein
Norme e vita scolastica del Liceo Einstein
 
LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
LO-PHI: Low-Observable Physical Host Instrumentation for Malware AnalysisLO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis
 
The Filter Bubble: a threat for plural information?
The Filter Bubble: a threat for plural information?The Filter Bubble: a threat for plural information?
The Filter Bubble: a threat for plural information?
 
Orazio - Sermones I, IX
Orazio - Sermones I, IXOrazio - Sermones I, IX
Orazio - Sermones I, IX
 
PageRank Algorithm In data mining
PageRank Algorithm In data miningPageRank Algorithm In data mining
PageRank Algorithm In data mining
 
Pagerank Algorithm Explained
Pagerank Algorithm ExplainedPagerank Algorithm Explained
Pagerank Algorithm Explained
 

Ähnlich wie Ranking nodes in growing networks: when PageRank fails

cs224w-79-final
cs224w-79-finalcs224w-79-final
cs224w-79-final
Darren Koh
 
Predicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networksPredicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networks
Anvardh Nanduri
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
dickonsondorris
 
A new approach to erd s collaboration network using page rank
A new approach to erd s collaboration network using page rankA new approach to erd s collaboration network using page rank
A new approach to erd s collaboration network using page rank
Alexander Decker
 

Ähnlich wie Ranking nodes in growing networks: when PageRank fails (20)

Scale-Free Networks to Search in Unstructured Peer-To-Peer Networks
Scale-Free Networks to Search in Unstructured Peer-To-Peer NetworksScale-Free Networks to Search in Unstructured Peer-To-Peer Networks
Scale-Free Networks to Search in Unstructured Peer-To-Peer Networks
 
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKSEVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
 
FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK
FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANKFINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK
FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK
 
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKSEVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
 
Sub-Graph Finding Information over Nebula Networks
Sub-Graph Finding Information over Nebula NetworksSub-Graph Finding Information over Nebula Networks
Sub-Graph Finding Information over Nebula Networks
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
001 20151005 ranking_nodesingrowingnetwork
001 20151005 ranking_nodesingrowingnetwork001 20151005 ranking_nodesingrowingnetwork
001 20151005 ranking_nodesingrowingnetwork
 
Centrality Prediction in Mobile Social Networks
Centrality Prediction in Mobile Social NetworksCentrality Prediction in Mobile Social Networks
Centrality Prediction in Mobile Social Networks
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Ijetcas14 347
Ijetcas14 347Ijetcas14 347
Ijetcas14 347
 
cs224w-79-final
cs224w-79-finalcs224w-79-final
cs224w-79-final
 
Predicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networksPredicting_new_friendships_in_social_networks
Predicting_new_friendships_in_social_networks
 
08 Exponential Random Graph Models (2016)
08 Exponential Random Graph Models (2016)08 Exponential Random Graph Models (2016)
08 Exponential Random Graph Models (2016)
 
08 Exponential Random Graph Models (ERGM)
08 Exponential Random Graph Models (ERGM)08 Exponential Random Graph Models (ERGM)
08 Exponential Random Graph Models (ERGM)
 
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATAEFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
EFFICIENTLY PROCESSING OF TOP-K TYPICALITY QUERY FOR STRUCTURED DATA
 
PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.PhD Consortium ADBIS presetation.
PhD Consortium ADBIS presetation.
 
IRJET- Link Prediction in Social Networks
IRJET- Link Prediction in Social NetworksIRJET- Link Prediction in Social Networks
IRJET- Link Prediction in Social Networks
 
User_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docxUser_42751212015Module1and2pagestocompetework.pdf.docx
User_42751212015Module1and2pagestocompetework.pdf.docx
 
A new approach to erd s collaboration network using page rank
A new approach to erd s collaboration network using page rankA new approach to erd s collaboration network using page rank
A new approach to erd s collaboration network using page rank
 

Kürzlich hochgeladen

biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 

Kürzlich hochgeladen (20)

CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 

Ranking nodes in growing networks: when PageRank fails

  • 1. Ranking nodes in growing networks: When PageRank fails Pietro De Nicolao pietro.denicolao@mail.polimi.it Politecnico di Milano April 14, 2016
  • 2. Outline Introduction A growing network model: the Relevance Model Real data analysis Conclusions
  • 3. Outline Introduction A growing network model: the Relevance Model Real data analysis Conclusions
  • 4. Paper www.nature.com/scientificreports Ranking nodes in growing networks: When PageRank fails Manuel Sebastian Mariani1 , Matúš Medo1 & Yi-Cheng Zhang1,2 PageRank is arguably the most popular ranking algorithm which is being applied in real systems ranging from information to biological and infrastructure networks. Despite its outstanding popularity and broad use in different areas of science, the relation between the algorithm’s efficacy and properties of the network on which it acts has not yet been fully understood. We study here PageRank’s performance on a network model supported by real data, and show that realistic temporal effects make PageRank fail in individuating the most valuable nodes for a broad range of model parameters. Results on real data are in qualitative agreement with our model-based findings. This failure of PageRank reveals that the static approach to information filtering is inappropriate for a broad class of growing systems, and suggest that time-dependent algorithms that are based on the temporal linking patterns of these systems are needed to better rank the nodes. received: 06 June 2015 accepted: 10 August 2015 Published: 10 November 2015 OPEN Published on Nature Scientific Reports on 10 November 2015.
  • 5. PageRank: recap Most popular ranking algorithm for unipartite directed networks. Invented for Google’s search algorithm Also used for the ranking of: scholarly papers images in search urban roads according to traffic flow proteins in their interaction network etc. A node is important if it is pointed by other important nodes. pij = (1 − γ) wij sout i + γ 1 N
  • 6. One PageRank fits all? What is the relation between PageRank’s efficacy and the properties of the network? PageRank: static approach PageRank discards temporal information works as if nodes appear all at the same time well-known bias towards old nodes Theoretical models and real networks can exhibit strong temporal patterns. Are there circumstances under which the algorithm is doomed to fail?
  • 7. Outline Introduction A growing network model: the Relevance Model Real data analysis Conclusions
  • 8. The Relevance Model (RM) What is the Relevance Model? growing directed network model with preferential attachment and relevance generalizes the classical Barabási-Albert model introduced by [Medo, 2011] Why using RM? model that best explains the linking patterns in real networks used to model WWW, citation and technological networks
  • 9. Relevance Model features 1. Preferential attachment similar to the Barabási-Albert model Matthew effect: the rich get richer significant difference: existing nodes also create new links 2. Fitness quality parameter assigned to each node node’s inherent competence in attracting new incoming links concept formerly explored in [Bianconi-Barabási, 2001] 3. Relevance and activity Relevance: capacity of attracting new links over time Activity: rate at which the node generates new outgoing links 4. Temporal decay Relevance and activity both decay with time Monotonous function of choice (exponential, power law) Real-world phenomenon: nodes lose relevance over time
  • 10. Relevance decay: a real-world example 0.01 0.1 1 10 100 1000 5 10 15 20 25 30 35 40 45 50 relevance years after publication indegree>100 10<=indegree<100 indegree<10 emporal decay of the average relevance r(t) in the APS datas color online). Symbols represent the average relevance of nodes belongi or bars represent the error of the mean, lines represent the fits described on-monotonous part of the relavance profile is ignored by the fitting Figure: Temporal decay of the average (empirical) relevance r(t) of papers in the American Physical Society citation network (1893-2009). This behaviour has been formerly highlighted in [Medo, 2011].
  • 11. How to build a network with the Relevance Model At each discrete time interval t, the generation algorithm proceeds as follows: 1. a new node is created and connected to an existing node i, chosen with probability Πin i (t). 2. If t > 10, then m = 10 existing nodes are sequentially chosen with probability Πout i (t) and become active: each selected node creates one outgoing link it selects a node j as a target with probability Πin j (t) No multiple links. No self loops.
  • 12. Relevance Model: Link generation mechanism The probability for the node i to be the target of a new link created at time t is: Πin i (t) ∼ (kin i (t) + 1) ηi fR(t − τi) kin i (t): current indegree of node i ηi: fitness of i τi: time at which i enters the network fR: monotonously decaying function of time Ri(t) := ηifR(t − τi): relevance of node i at time t.
  • 13. Relevance Model: Active nodes selection In the RM, nodes continue being active and generate outgoing links continually. Probability for node i to be chosen as an active node at time t: Πout i (t) ∼ Ai fA(t − τi) Ai: activity parameter τi: time at which i enters the network fA: monotonously decaying function of time
  • 14. Effects of relevance decay in the RM Slow or absent relevance decay recent nodes receive few links because of preferential attachment PageRank’s bias towards old nodes in scale-free networks Fast relevance decay preferential attachment compensated by decay of relevance of old nodes recent nodes can reach high indegree recent nodes mostly point to other recent nodes, because of relevance decay of older nodes old nodes point to nodes of every age because of activity
  • 15. What makes a ranking algorithm “good”? A good ranking algorithm is expected to produce an unbiased ranking where both recent and old nodes have the same chance to appear at the top. In growing networks with temporal effects, PageRank can fail to achieve this. Let’s compare PageRank with the elementary indegree ranking.
  • 16. PageRank time bias: numerical simulation of RM om/scientificreports/ 0 2000 4000 6000 8000 10000 10 100 1000 10000 averageτoftop1%nodes θR indegree pageRank no bias Figure 2. PageRank time bias. We show here the average entrance time τ of the top 1% nodes of the node ranking by indegree and PageRank, respectively, as a function of the relevance decay parameter θR. Networks of =N 10000 nodes are grown with the RM with slow decay of activity (θ = NA ). Two limits of PageRank bias are visible: (1) When the decay of relevance is fast θ θ( )R A , a large number of top nodes are recent as a consequence of the network structure demonstrated in Fig. 1; (2) When the decay of relevance is slow (θ ∼ NR ), top nodes are old because the old nodes can be pointed by nodes of every age. While the latter bias is common to PageRank and indegree, the former bias is specific to PageRank because of its network nature. age relevance decayfast slow oldrecent Relevance decays as fR(t) = exp(− t θR ). Activity decays exponentially, but very slowly (θA = N). N = 10000 Figure: Average time of entrance of 1% of nodes of PageRank and indegree rankings, in the RM model.
  • 17. PageRank vs. indegree: correlation with fitness in RM in this limit as the average entrance time τ of the top-1% nodes is close to / =N 2 5000 which corre- sponds to the absence of time bias. (θ ∼ NR ), top nodes are old because the old nodes can be pointed by nodes of every age. While the latter bias is common to PageRank and indegree, the former bias is specific to PageRank because of its network nature. Figure 3. A comparison of performance of PageRank and indegree in the RM data (N=10,000. ρ(η)=exp(−η). The heatmap shows the ratio η η( , )/ ( , )r p r kin . The black dotted line represents the contour along which PageRank is not temporally biased (see Fig. S6, left). The upward bending of this contour is a finite-size effect. r(p, η): correlation PageRank-fitness r(kin , η): correlation indegree-fitness ρ(η) = exp(−η) ρ(A) = 2A−3 , A ∈ [1, ∞] Figure: Comparison of performance of PageRank and indegree (RM data). PageRank yields no improvement with respect to indegree. Diagonal: no temporal bias towards recent or old nodes.
  • 18. Outline Introduction A growing network model: the Relevance Model Real data analysis Conclusions
  • 19. Real networks studied in the paper Real networks studied (directed, unweighted): Digg.com: social bookmarking site Nodes: Digg users Edges: aij = 1 ⇔ “i is a follower of j”. N = 190 553; L = 1 552 905 American Physical Society (APS) articles and citation network Nodes: papers (from 1893 to 2009) Edges: aij = 1 ⇔ “i cites j” N = 450 056; L = 4 690 967 Estimator of node fitness: total relevance of node i Ti = t ri(t) To validate hypothesis of relevance and activity decay: measurement of empirical relevance (see appendix).
  • 20. PageRank performance on real data cientificreports/ 0 0.1 0.2 0.3 0.4 0.5 0.6 Digg.com Digg-calibrated RM APS APS-calibrated RM correlationwithtotalrelevanceT indegree PageRank Figure 5. A comparison of PageRank and indegree correlation with total relevance in real data and in calibrated simulations with the RM. PageRank is outperformed by indegree in both datasets (and in the corresponding calibrated simulations). In the Digg.com social network, the fitted relevance and activity power-law decay exponents are not far from the parameter region where PageRank is maximally correlated with indegree in numerical simulations with the RM with power-law decay (see Fig. S10), and PageRank’s and indegree’s correlation with total relevance are close to each other. By contrast, in the APS dataset activity decays immediately, whereas relevance decays progressively (see Fig. S2); as a consequence, PageRank is strongly biased towards old nodes (see Fig. S3) and is outperformed by indegree by a factor 2.58 [ ( , ) = .r p T 0 19 whereas ( , ) = .r k T 0 49in ]. We refer to the Supplementary Note S3 for details about the Digg.com: activity and relevance decay s.t. PageRank is maximally correlated with indegree in RM simulations with power-law decay. APS: activity decays immediately, relevance decays progressively. Figure: Comparison of PageRank and indegree correlation with total relevance Ti in real data. APS: PageRank strongly biased towards old nodes, because papers can only be cited by more recent papers.
  • 21. Outline Introduction A growing network model: the Relevance Model Real data analysis Conclusions
  • 22. Important findings PageRank can underperform w.r.t. indegree ranking Mismatch between relevance and activity decay timescales leads to time bias in PageRank: towards recent nodes if decay of relevance is faster towards old nodes if decay of activity is faster Findings are robust with respect to: form of decay function distribution of fitness among the nodes metric used to evaluate the algorithm Link timestamps are crucial for this analysis Method can not be applied to undirected networks
  • 23. In conclusion. . . PageRank, despite its popularity and robustness, can fail and thus it should not be used without carefully considering the temporal properties of the system to which it is to be applied.
  • 24. Bibliography Mariani M. S., Medo M., Zhang Y. Ranking nodes in growing networks: When PageRank fails. Scientific Reports 5, 16181; doi: 10.1038/srep16181 (2015). G. Bianconi, A. L. Barabási Competition and Multiscaling in evolving networks Europhysics Letters, Vol. 54 (2001), pp. 436-442 doi:10.1209/epl/i2001-00260-6 Medo M., Cimini G., Gualdi S. Temporal Effects in the Growth of Networks Phys. Rev. Lett. 107, 238701 (2011-12-01)
  • 26. Empirical relevance: definition The empirical relevance ri(t) of node i at time t is defined as: ri(t) = ni(t) nPA i (t) ni(t) = ∆kin i (t,∆t) L(t,∆t) : ratio between: ∆kin i (t, ∆t): # of incoming links received by node i in the time window [t, t + ∆t] L(t, ∆t): total # of links created within the same time window nPA i (t) = kin i (t) j kin j (t) : expected value of ni(t) according to preferential attachment alone ri(t) > 1 (< 1): node i at time t outperforms (underperforms) in the competition for incoming links with respect to its preferencial attachment weight.
  • 28. The Extended Fitness Model PageRank’s under-performance in time-dependent networks is a general feature. We can validate this using a model more compatible with the idea that a node is important if it’s pointed by other important nodes. Extended Fitness Model (EFM) High-fitness nodes are more sensitive to fitness than low-fitness nodes, when choosing their outgoing links. High-fitness nodes are then more likely to be pointed by other high-fitness nodes than low-fitness nodes. EFM is more favorable to PageRank than RM.
  • 29. EFM: sensitivity to fitness Probability Πin i;j(t) that a link created by node j at time t ends in node i: Πin i;j(t) ∼ (kin i (t) + 1)1−ηj η ηj i fR(t − τi) Fitness η ∈ [0, 1] to prevent negative exponents Πin depends on the fitness of the target and of the source nodes (difference with RM). kin i (t): indegree of node i at time t.
  • 30. PageRank vs. indegree: correlation with fitness in EFM scientificreports/ with indegree along theθ θ−R A diagonal where PageRank is not temporally biased and η η( , )/ ( , )r p r kin becomes close to, albeit always strictly lower than, one. When moving away from this diagonal, PageRank Figure 4. A comparison of PageRank and indegree correlation with fitness in the EFM data (N=10,000, H=250). The heatmap shows the ratio ( , )/ ( , )r p q r k qin . The white dotted line represents the contour where PageRank is not temporally biased (see Fig. S6, right). ρ(A) = 2A−3 , A ∈ [1, ∞] H nodes: high fitness; η ∈ [10−5 , 1] (N − H) nodes: low fitness; η ∈ [0, 10−5 ] N = 10000, H = 250 Figure: Comparison of performance of PageRank and indegree (EFM data).