SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Thanks: Major part of this work done during
visit at Twitter’s Personalization and
Recommendations team (Fall-2012).

DrunkardMob: Billions of
Random Walks on Just a PC
Aapo Kyrola
Carnegie Mellon University
Twitter: @kyrpov
Big Data – small machine
DrunkardMob - RecSys '13
This work in a Nutshell
1. Background: Random walk –based
methods are popular in Recommender
Systems.
2. Research problem: How to simulate
random walks if your graph does not fit in
memory?
3. Solution: Instead of doing one walk a
time, do billions of them a time. Stream
graph from disk and maintain walk states
in RAM.
DrunkardMob - RecSys '13
Contents
•
•
•
•

Introduction to random walks
Disk-based graph systems: GraphChi
DrunkardMob algorithm
Experiments

All code available in GitHub:
http://github.com/graphchi/graphchi-java
DrunkardMob - RecSys '13
Introduction: Random Walks
• Graph: G(V, E)
– V = vertices / nodes, E = edges / links.

• Walk is a sequence of random t visits to
vertices:
w := source(0)  v(1)  v(2)  v(3) …. 
v(t)

• Walks follow edges by default, but can
also reset or teleport with certain
probability.
– Transition probability:'13 P(v(k+1) | v(k))
DrunkardMob - RecSys
Introduction (cont.)
• Usually we are interested about the
distribution of the visits.
– Either global distribution or for each source
separately.
– Many applications (PageRank, FolkRank,
SALSA,..)

• Can be used to generate candidates:
– Choose top K visited vertices as candidates to
recommend.
DrunkardMob - RecSys '13
Example: Global PageRank
• Model: random surfer who
starts from random
webpage and clicks each
link on the page with
uniform probability:
– With probability d, teleports
to a random vertex  infinite
walk.

“any vertex”
P=d

P=(1-d) / 3
?
P=(1-d) / 3

P=(1-d) / 3

• Pagerank(web page) ~
Can
authority of web page. be computed using “power iteration” very
efficiently (in secs / minutes even for graphs with
billions of vertices)  Not interesting.
DrunkardMob - RecSys '13
Personalized Pagerank
• Pagerank | home
(source) nodes:
– Compute pagerank vector
for each node separately
 resets only to the home
node(s).
– Restrict home nodes to
some category / topic /
pages visited by a user.

• Used e.g. for social
network
recommendations.
DrunkardMob - RecSys '13

home vertex
P=d

P=(1-d) / 3
?
P=(1-d) / 3

P=(1-d) / 3
Personalized Pagerank (cont.)
• Naïve computation of Personalized
Pagerank (PPR):
– Compute pagerank vector for each source
separately using power iteration: O(n^2)

• Approximate by sampling:
– Simulate actual walks on the graph.

DrunkardMob - RecSys '13
Random walk in an in-memory
graph
• Compute one walk a time (multiple in
parallel, of course): in walks:
parfor walk
for i=1 to
:
vertex = walk.atVertex()
walk.takeStep(vertex.randomNeighbor())

DrunkardMob - RecSys '13
Problem: What if Graph does not
fit in memory?
Twitter network visualization,
by Akshay Java, 2009

Disk-based “singlemachine” graph
systems:
- “Paging” from disk
is costly.

Distributed graph
systems:
- Each hop across
partition boundary
is costly.

(This talk)

DrunkardMob - RecSys '13
DISK-BASED GRAPH
SYSTEMS
DrunkardMob - RecSys '13
Disk-based Graph Systems
• Recently frameworks that can handle
graphs with billions of edges on a single
machine, using disk, have been
proposed:
– GraphChi (Kyrola, Blelloch, Guestrin:
OSDI’12)
– TurboGraph (KDD’13)
– [X-Stream (SOSP’13) – model not suitable]

• We assume vertex-centric model:
– Computation done one vertex a time.
DrunkardMob - RecSys '13
GraphChi execution model
1

v1

v2

n

interval(1)

interval(2)

interval(P)

shard(1)

shard(2)

shard(P)

For T iterations:
For p=1 to P
For vertex in interval(p)
updateFunction(vertex)
DrunkardMob - RecSys '13
Random walk is often called “Drunkard’s Walk”

DRUNKARDMOB ALGORITHM

DrunkardMob - RecSys '13
DrunkardMob: Basic Idea
• By example:
– Task: Compute personalized pagerank (PPR) for
1 million users in a social network -- in parallel
• I.e 1MM different home/source -nodes

– For each user, launch 1000 random walks (with
resets) – in parallel
• Each walk takes 10 hops
~ Equivalent to one 10,000 hop walk (with resets) / user

– For each user, keep track of the visits done by its
1000 short walks  PPR for each user.
– Store state of each walk in RAM, process graph
from disk.
= 1B random walks in parallel  ~5 GB of RAM.
DrunkardMob - RecSys '13
Random walks in GraphChi
• DrunkardMob –algorithm
– Reverse thinking
ForEach interval p:
walkSnapshot = getWalksForInterval(p)
ForEach vertex in interval(p):
mywalks = walkSnapshot.getWalksAtVertex(vertex.id)
ForEach walk in mywalks:
walkManager.addHop(walk, vertex.randomNeighbor())

Note: Need to store only
current position of each walk!

DrunkardMob - RecSys '13
WalkManager
• Store walks in buckets
– Array for each vertex would cost too much.

DrunkardMob - RecSys '13
Encoding walks

Only 4 bytes /
walk.

Keeps track of
each path 
knowledge
base
applications.

DrunkardMob - RecSys '13
Keeping track of walks
GraphChi

Walk Distribution Tracker
(DrunkardCompanion)

Execution interval

Source A
top-N visits

Vertex walks table (WalkManager)

DrunkardMob - RecSys '13

Source B
top-N visits
Keeping track of walks
GraphChi

Walk Distribution Tracker
(DrunkardCompanion)

Execution interval

Source A
top-N visits

Vertex walks table (WalkManager)

DrunkardMob - RecSys '13

Source B
top-N visits
Keeping track of Walks
• If we don’t have enough RAM to store the
distributions:
– Cut long tails: Similar problem to estimating
top-K frequent items in data streams with
limited memory.

• Can also write hops to disk (bucket-bybucket) and analyze later.

DrunkardMob - RecSys '13
Validity
• We assume that simulating 2000 x 5-hop
walks with resets ~ 10000-hop walk with
resets.
– Not exactly same distribution – some longer
streaks not covered.
• But those would be not relevant anyway for
recommendations!

– See Fogaras (2005) for analysis.

DrunkardMob - RecSys '13
Related Work
• Fogaras, Racz, Csalogany, Sarlos:
“Towards scaling fully personalized
pagerank: Algorithms, lower bounds,
experiments” (2005)
– Similar idea with full external memory
implementation.
• We keep walks in memory.

• Plenty of research in approximating PPR.

DrunkardMob - RecSys '13
See paper for more
experiments!

EXPERIMENTS

DrunkardMob - RecSys '13
Case Study: Twitter WTF
• Implemented Twitter’s Who-to-Follow
algorithm on GraphChi (see paper)
– Based on WWW’13 paper by Gupta et al.
– Use DrunkardMob to generate set of
candidates to recommend for each user.
– See paper.

DrunkardMob - RecSys '13
PPR: Full Twitter Graph
With a large server with SSD and 144 GB of memory:

On Mac laptop, could estimate 500K-1M PPRs )= 0.51B walks ) in roughly the same time.
DrunkardMob - RecSys '13
Runtime / Graph size

Running time ~ linear with graph size
DrunkardMob - RecSys '13
Comparison to in-memory walks

Competitive with in-memory walks. However, if you can fit
your graph in memory – no need for DrunkardMob.
DrunkardMob - RecSys '13
Summary
• DrunkardMob allows simulating random
walks efficiently on extremely large graphs
– Uses bulk of RAM for keeping track of walks,
graph streamed from disk.
– Graph size not limited by RAM.
– Implement Twitter Who-To-Follow on your Laptop!

• Future work: Adapt to distributed graph
systems.
– Even Hadoop if you really really want.
DrunkardMob - RecSys '13
Thank You!
• Code: http://github.com/graphchi/graphchijava
Aapo Kyrölä
Ph.D. candidate @ CMU
http://www.cs.cmu.edu/~akyrola
Twitter: @kyrpov

Special thanks to Pankaj Gupta, Dong Wang, Aneesh
Sharma and Jayarama Shenoy @ Twitter.
DrunkardMob - RecSys '13

Weitere ähnliche Inhalte

Ähnlich wie DrunkardMob: Billions of Random Walks on Just a PC

A Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsDonald Nguyen
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Ontico
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
St Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel RSt Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel RAndrew Bzikadze
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pigjavicid
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericNik Peric
 
Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Jen Waller
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
 
Performance myths in android
Performance myths in androidPerformance myths in android
Performance myths in androidJavier Gamarra
 
Wuala, P2P Online Storage
Wuala, P2P Online StorageWuala, P2P Online Storage
Wuala, P2P Online Storageadunne
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processinghuguk
 

Ähnlich wie DrunkardMob: Billions of Random Walks on Just a PC (20)

A Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph AnalyticsA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph Analytics
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
St Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel RSt Petersburg R user group meetup 2, Parallel R
St Petersburg R user group meetup 2, Parallel R
 
EEDC - Apache Pig
EEDC - Apache PigEEDC - Apache Pig
EEDC - Apache Pig
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
 
Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)Spark Gotchas and Lessons Learned (2/20/20)
Spark Gotchas and Lessons Learned (2/20/20)
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
As simple as Apache Spark
As simple as Apache SparkAs simple as Apache Spark
As simple as Apache Spark
 
Pydata talk
Pydata talkPydata talk
Pydata talk
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
Performance myths in android
Performance myths in androidPerformance myths in android
Performance myths in android
 
Wuala, P2P Online Storage
Wuala, P2P Online StorageWuala, P2P Online Storage
Wuala, P2P Online Storage
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 

Kürzlich hochgeladen

Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 

Kürzlich hochgeladen (20)

Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 

DrunkardMob: Billions of Random Walks on Just a PC

  • 1. Thanks: Major part of this work done during visit at Twitter’s Personalization and Recommendations team (Fall-2012). DrunkardMob: Billions of Random Walks on Just a PC Aapo Kyrola Carnegie Mellon University Twitter: @kyrpov Big Data – small machine DrunkardMob - RecSys '13
  • 2. This work in a Nutshell 1. Background: Random walk –based methods are popular in Recommender Systems. 2. Research problem: How to simulate random walks if your graph does not fit in memory? 3. Solution: Instead of doing one walk a time, do billions of them a time. Stream graph from disk and maintain walk states in RAM. DrunkardMob - RecSys '13
  • 3. Contents • • • • Introduction to random walks Disk-based graph systems: GraphChi DrunkardMob algorithm Experiments All code available in GitHub: http://github.com/graphchi/graphchi-java DrunkardMob - RecSys '13
  • 4. Introduction: Random Walks • Graph: G(V, E) – V = vertices / nodes, E = edges / links. • Walk is a sequence of random t visits to vertices: w := source(0)  v(1)  v(2)  v(3) ….  v(t) • Walks follow edges by default, but can also reset or teleport with certain probability. – Transition probability:'13 P(v(k+1) | v(k)) DrunkardMob - RecSys
  • 5. Introduction (cont.) • Usually we are interested about the distribution of the visits. – Either global distribution or for each source separately. – Many applications (PageRank, FolkRank, SALSA,..) • Can be used to generate candidates: – Choose top K visited vertices as candidates to recommend. DrunkardMob - RecSys '13
  • 6. Example: Global PageRank • Model: random surfer who starts from random webpage and clicks each link on the page with uniform probability: – With probability d, teleports to a random vertex  infinite walk. “any vertex” P=d P=(1-d) / 3 ? P=(1-d) / 3 P=(1-d) / 3 • Pagerank(web page) ~ Can authority of web page. be computed using “power iteration” very efficiently (in secs / minutes even for graphs with billions of vertices)  Not interesting. DrunkardMob - RecSys '13
  • 7. Personalized Pagerank • Pagerank | home (source) nodes: – Compute pagerank vector for each node separately  resets only to the home node(s). – Restrict home nodes to some category / topic / pages visited by a user. • Used e.g. for social network recommendations. DrunkardMob - RecSys '13 home vertex P=d P=(1-d) / 3 ? P=(1-d) / 3 P=(1-d) / 3
  • 8. Personalized Pagerank (cont.) • Naïve computation of Personalized Pagerank (PPR): – Compute pagerank vector for each source separately using power iteration: O(n^2) • Approximate by sampling: – Simulate actual walks on the graph. DrunkardMob - RecSys '13
  • 9. Random walk in an in-memory graph • Compute one walk a time (multiple in parallel, of course): in walks: parfor walk for i=1 to : vertex = walk.atVertex() walk.takeStep(vertex.randomNeighbor()) DrunkardMob - RecSys '13
  • 10. Problem: What if Graph does not fit in memory? Twitter network visualization, by Akshay Java, 2009 Disk-based “singlemachine” graph systems: - “Paging” from disk is costly. Distributed graph systems: - Each hop across partition boundary is costly. (This talk) DrunkardMob - RecSys '13
  • 12. Disk-based Graph Systems • Recently frameworks that can handle graphs with billions of edges on a single machine, using disk, have been proposed: – GraphChi (Kyrola, Blelloch, Guestrin: OSDI’12) – TurboGraph (KDD’13) – [X-Stream (SOSP’13) – model not suitable] • We assume vertex-centric model: – Computation done one vertex a time. DrunkardMob - RecSys '13
  • 13. GraphChi execution model 1 v1 v2 n interval(1) interval(2) interval(P) shard(1) shard(2) shard(P) For T iterations: For p=1 to P For vertex in interval(p) updateFunction(vertex) DrunkardMob - RecSys '13
  • 14. Random walk is often called “Drunkard’s Walk” DRUNKARDMOB ALGORITHM DrunkardMob - RecSys '13
  • 15. DrunkardMob: Basic Idea • By example: – Task: Compute personalized pagerank (PPR) for 1 million users in a social network -- in parallel • I.e 1MM different home/source -nodes – For each user, launch 1000 random walks (with resets) – in parallel • Each walk takes 10 hops ~ Equivalent to one 10,000 hop walk (with resets) / user – For each user, keep track of the visits done by its 1000 short walks  PPR for each user. – Store state of each walk in RAM, process graph from disk. = 1B random walks in parallel  ~5 GB of RAM. DrunkardMob - RecSys '13
  • 16. Random walks in GraphChi • DrunkardMob –algorithm – Reverse thinking ForEach interval p: walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p): mywalks = walkSnapshot.getWalksAtVertex(vertex.id) ForEach walk in mywalks: walkManager.addHop(walk, vertex.randomNeighbor()) Note: Need to store only current position of each walk! DrunkardMob - RecSys '13
  • 17. WalkManager • Store walks in buckets – Array for each vertex would cost too much. DrunkardMob - RecSys '13
  • 18. Encoding walks Only 4 bytes / walk. Keeps track of each path  knowledge base applications. DrunkardMob - RecSys '13
  • 19. Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Execution interval Source A top-N visits Vertex walks table (WalkManager) DrunkardMob - RecSys '13 Source B top-N visits
  • 20. Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Execution interval Source A top-N visits Vertex walks table (WalkManager) DrunkardMob - RecSys '13 Source B top-N visits
  • 21. Keeping track of Walks • If we don’t have enough RAM to store the distributions: – Cut long tails: Similar problem to estimating top-K frequent items in data streams with limited memory. • Can also write hops to disk (bucket-bybucket) and analyze later. DrunkardMob - RecSys '13
  • 22. Validity • We assume that simulating 2000 x 5-hop walks with resets ~ 10000-hop walk with resets. – Not exactly same distribution – some longer streaks not covered. • But those would be not relevant anyway for recommendations! – See Fogaras (2005) for analysis. DrunkardMob - RecSys '13
  • 23. Related Work • Fogaras, Racz, Csalogany, Sarlos: “Towards scaling fully personalized pagerank: Algorithms, lower bounds, experiments” (2005) – Similar idea with full external memory implementation. • We keep walks in memory. • Plenty of research in approximating PPR. DrunkardMob - RecSys '13
  • 24. See paper for more experiments! EXPERIMENTS DrunkardMob - RecSys '13
  • 25. Case Study: Twitter WTF • Implemented Twitter’s Who-to-Follow algorithm on GraphChi (see paper) – Based on WWW’13 paper by Gupta et al. – Use DrunkardMob to generate set of candidates to recommend for each user. – See paper. DrunkardMob - RecSys '13
  • 26. PPR: Full Twitter Graph With a large server with SSD and 144 GB of memory: On Mac laptop, could estimate 500K-1M PPRs )= 0.51B walks ) in roughly the same time. DrunkardMob - RecSys '13
  • 27. Runtime / Graph size Running time ~ linear with graph size DrunkardMob - RecSys '13
  • 28. Comparison to in-memory walks Competitive with in-memory walks. However, if you can fit your graph in memory – no need for DrunkardMob. DrunkardMob - RecSys '13
  • 29. Summary • DrunkardMob allows simulating random walks efficiently on extremely large graphs – Uses bulk of RAM for keeping track of walks, graph streamed from disk. – Graph size not limited by RAM. – Implement Twitter Who-To-Follow on your Laptop! • Future work: Adapt to distributed graph systems. – Even Hadoop if you really really want. DrunkardMob - RecSys '13
  • 30. Thank You! • Code: http://github.com/graphchi/graphchijava Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov Special thanks to Pankaj Gupta, Dong Wang, Aneesh Sharma and Jayarama Shenoy @ Twitter. DrunkardMob - RecSys '13

Hinweis der Redaktion

  1. ----- Meeting Notes (10/15/13 17:19) -----
  2. So how would we do this if we could fit the graph in memory?