SlideShare ist ein Scribd-Unternehmen logo
1 von 21
VENUS: Vertex-Centric Streamlined
Graph Computation on a Single PC
Jiefeng Cheng1, Qin Liu2, Zhenguo Li1,
Wei Fan1, John C.S. Lui2, Cheng He1
1Huawei Noah’s Ark Lab
2 The Chinese University of Hong Kong
ICDE’15
Graph is everywhere
 We have large graphs
• Web graph
• Social graph
• User-movie ratings graph
• …
 Graph Computation
• PageRank
• Community detection
• ALS for collaborative filtering
• …
Mining from Big Graphs:
two feasible ways
 Distributed systems
• Pregel[SIGMOD’10], GraphLab[OSDI’12],
GraphX[OSDI’14], Giraph, ...
• Expensive cluster, complex setup, writing distributed
programs
 Single-machine system
• Disk: GraphChi[OSDI’12], X-Stream[SOSP’13]
• SSD: TurboGraph[KDD’13], FlashGraph[FAST’15]
• Computation time close to distributed systems
• PageRank on Twitter graph (41M nodes, 1.4B edges)
• Spark: 8.1min with 50 machines (each with 2 CPUs, 7.5G
RAM)[Stanton KDD’12]
• VENUS: 8 min on a single machine with quad-core CPU, 16G RAM
• Affordable, easy to program/debug
Existing Systems
 Vertex-centric programming model: popularized
by Pregel / GraphLab / GraphChi
• Each vertex updates itself based on its neighborhood
 GraphChi
• Updated data on each vertex must be propagated to its
neighbors through disk
• Extensive disk I/O
 X-Stream
• Different API: edge-centric programming
• Less expressive, re-implement common algorithms
• Also use disk to propagate updates
Our Contributions
 Design and implement a disk-based system,
VENUS
• A new vertex-centric streamlined processing model
• Separate mutable vertex data and immutable edge
data
• Read/Write less data compared to other systems
 Evaluation on large graphs
• Outperform GraphChi and X-Stream
• Verify that our design reduce data access
Vertex-Centric Programming
 Consider GraphChi
for each iteration
for each vertex v
update(v)
void update(v)
fetch data from each in-edge
update data on v
spread data to each out-edge
Duplicated data
v
Vertex-Centric Programming
 VENUS:
• Only store mutable values on vertices
 Pros
• Less data access
• Enable ``streamlined’’ processing
 Cons
• Limited expressiveness
void update(v)
fetch data from each in-edge
update data on v
spread data to each out-edge
in-neighbor
v
VENUS Architecture
VENUS Architecture
 Disk storage (offline)
• Sharding
• Separation of edge data and vertex data
 Computing model (online)
• Load edge data sequentially
• Execute the update function on each vertex
• How to load vertex data and propagate updates
Sharding
 Graph cannot fit in RAM?
• Split the graph into shards
 Each shard corresponds to an interval of vertices:
• G-shard: immutable structure of graph
• In-edges of nodes in the interval
• V-shard: mutable vertex values
• Vertex values of all vertices in the shard
 Structure table: all g-shards
 Value table: all vertex data
Vertex ID 1 2 3 4 5 6 7 8 9 10 1
1
12
Data
Interval I1=[1,4] I2=[5,8] I3=[9,12]
G-shard 7,9,10 → 1
6,10 → 2
1,2,6 → 3
1,2,6,7,10 → 4
6,7,8,11 → 5
1,10 → 6
3,10,11 → 7
3,6,11 → 8
2,3,4,10,11 → 9
11 → 10
4,6 → 11
2,3,9,10,11 → 12
V-shard I1∪{6,7,9,10} I2∪{1,3,10,11} I3∪{2,3,4,6}
Vertex-Centric Streamlined
Processing
 V-shards are much smaller than g-shards
• Load each v-shard entirely into memory
 Scan each g-shard sequentially
• Execute the update function in parallel
Execution
Load v-shard 1
7,9,10 → 1
6,10 → 2
1,2,6 → 3
1,2,6,7,10 → 4
Update v-shard 1
Load v-shard 2
6,7,8,11 → 5
1,10 → 6
3,10,11 → 7
3,6,11 → 8
Update v-shard 2
Load v-shard 3
2,3,4,10,11 → 9
11 → 10
4,6 → 11
2,3,9,10,11 → 12
Update v-shard 3
LoadingExecution
Parallelize execution
and loading
Load and Update v-shards
 Two I/O efficient algorithms
• Algorithm 1: Extension of PSW in GraphChi (skip)
• Algorithm 2: Merge-Join
• Load: merge-join between value table and v-shard
• Update: write values of [1,4] back to vertex table
 Use value buffer to cache value table
ID 1 2 3 4 5 6 7 8 9 10 1
1
12
Data
Value table on disk
ID 1 2 3 4 6 7 9 10 Vertices in v-shard 1
on disk
ID 1 2 3 4 6 7 9 10
Data
Loaded v-shard 1
Evaluation of VENUS
 Setup: a commodity PC
• quad-core 3.4GHz CPU
• 16GB RAM and 4TB hard disk
 Main competitors:
• GraphChi and X-Stream
 Applications:
• PageRank
• WCC: weakly connected components
• CD: community detection
• ALS: alternating least square for collaborative filtering
• Shortest path, label propagations, etc.
PageRank on Twitter
 Twitter follow-graph: 41M nodes, 1.4B edges
Cost of Updates Propagation:
Data Write and Read
Applications: WCC, CD, ALS
Failed to implement CD on X-Stream,
due to its edge-centric programming
model
Web-Scale Graph
 Clueweb12: web scale graph
• 978 million nodes, 42.5 billion edges
• 402 GB on disk
• 2 iterations of PageRank
 Computation time
• GraphChi: 4.3 hours
• X-Stream: 7.4 hours
• VENUS-I: 2 hours
• VENUS-II: 1.8 hours
Conclusion
 Present a disk-based graph computation system,
VENUS
 Our design of graph storage and execution can
reduce data access and I/O
 Evaluations show it outperforms GraphChi and
X-Stream
 Also VENUS can handle billion-scale problems
Thank you!
Q&A

Weitere ähnliche Inhalte

Was ist angesagt?

Sky Arrays - ArrayDB in action for Sky View Factor Computation
Sky Arrays - ArrayDB in action for Sky View Factor ComputationSky Arrays - ArrayDB in action for Sky View Factor Computation
Sky Arrays - ArrayDB in action for Sky View Factor Computation
EUDAT
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Databricks
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Flink Forward
 
SGF15--e-Poster-Shevrin-3273
SGF15--e-Poster-Shevrin-3273SGF15--e-Poster-Shevrin-3273
SGF15--e-Poster-Shevrin-3273
ThomsonReuters
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 

Was ist angesagt? (20)

Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Sky Arrays - ArrayDB in action for Sky View Factor Computation
Sky Arrays - ArrayDB in action for Sky View Factor ComputationSky Arrays - ArrayDB in action for Sky View Factor Computation
Sky Arrays - ArrayDB in action for Sky View Factor Computation
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
 
Hopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open WorkshopHopsworks - ExtremeEarth Open Workshop
Hopsworks - ExtremeEarth Open Workshop
 
Partitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph ExecutionPartitioning SKA Dataflows for Optimal Graph Execution
Partitioning SKA Dataflows for Optimal Graph Execution
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
Scaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rushScaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rush
 
Introduction to yarn
Introduction to yarnIntroduction to yarn
Introduction to yarn
 
MapReduce
MapReduceMapReduce
MapReduce
 
Scaling Graphite At Yelp
Scaling Graphite At YelpScaling Graphite At Yelp
Scaling Graphite At Yelp
 
Big Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open WorkshopBig Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open Workshop
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 
LWA 2015: The Apache Flink Platform (Poster)
LWA 2015: The Apache Flink Platform (Poster)LWA 2015: The Apache Flink Platform (Poster)
LWA 2015: The Apache Flink Platform (Poster)
 
Highly Available Graphite
Highly Available GraphiteHighly Available Graphite
Highly Available Graphite
 
GIS file types
GIS file typesGIS file types
GIS file types
 
Smallworld to GreGG - FME Server Automation
Smallworld to GreGG - FME Server AutomationSmallworld to GreGG - FME Server Automation
Smallworld to GreGG - FME Server Automation
 
SGF15--e-Poster-Shevrin-3273
SGF15--e-Poster-Shevrin-3273SGF15--e-Poster-Shevrin-3273
SGF15--e-Poster-Shevrin-3273
 
Working with OpenStreetMap using Apache Spark and Geotrellis
Working with OpenStreetMap using Apache Spark and GeotrellisWorking with OpenStreetMap using Apache Spark and Geotrellis
Working with OpenStreetMap using Apache Spark and Geotrellis
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
 

Ähnlich wie VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC

Graph chi
Graph chiGraph chi
Graph chi
Jay Rathod
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
HPCC Systems
 

Ähnlich wie VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC (20)

Graph chi
Graph chiGraph chi
Graph chi
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Graph processing
Graph processingGraph processing
Graph processing
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache Flink
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
Social Networks Analysis
Social Networks AnalysisSocial Networks Analysis
Social Networks Analysis
 
Ecsr tutorial
Ecsr tutorialEcsr tutorial
Ecsr tutorial
 
Ling liu part 01:big graph processing
Ling liu part 01:big graph processingLing liu part 01:big graph processing
Ling liu part 01:big graph processing
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
Extending Hadoop for Fun & Profit
Extending Hadoop for Fun & ProfitExtending Hadoop for Fun & Profit
Extending Hadoop for Fun & Profit
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Leveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC SystemsLeveraging Intra-Node Parallelization in HPCC Systems
Leveraging Intra-Node Parallelization in HPCC Systems
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 

Kürzlich hochgeladen

一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
cnajjemba
 

Kürzlich hochgeladen (20)

一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 

VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC

  • 1. VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC Jiefeng Cheng1, Qin Liu2, Zhenguo Li1, Wei Fan1, John C.S. Lui2, Cheng He1 1Huawei Noah’s Ark Lab 2 The Chinese University of Hong Kong ICDE’15
  • 2. Graph is everywhere  We have large graphs • Web graph • Social graph • User-movie ratings graph • …  Graph Computation • PageRank • Community detection • ALS for collaborative filtering • …
  • 3. Mining from Big Graphs: two feasible ways  Distributed systems • Pregel[SIGMOD’10], GraphLab[OSDI’12], GraphX[OSDI’14], Giraph, ... • Expensive cluster, complex setup, writing distributed programs  Single-machine system • Disk: GraphChi[OSDI’12], X-Stream[SOSP’13] • SSD: TurboGraph[KDD’13], FlashGraph[FAST’15] • Computation time close to distributed systems • PageRank on Twitter graph (41M nodes, 1.4B edges) • Spark: 8.1min with 50 machines (each with 2 CPUs, 7.5G RAM)[Stanton KDD’12] • VENUS: 8 min on a single machine with quad-core CPU, 16G RAM • Affordable, easy to program/debug
  • 4. Existing Systems  Vertex-centric programming model: popularized by Pregel / GraphLab / GraphChi • Each vertex updates itself based on its neighborhood  GraphChi • Updated data on each vertex must be propagated to its neighbors through disk • Extensive disk I/O  X-Stream • Different API: edge-centric programming • Less expressive, re-implement common algorithms • Also use disk to propagate updates
  • 5. Our Contributions  Design and implement a disk-based system, VENUS • A new vertex-centric streamlined processing model • Separate mutable vertex data and immutable edge data • Read/Write less data compared to other systems  Evaluation on large graphs • Outperform GraphChi and X-Stream • Verify that our design reduce data access
  • 6. Vertex-Centric Programming  Consider GraphChi for each iteration for each vertex v update(v) void update(v) fetch data from each in-edge update data on v spread data to each out-edge Duplicated data v
  • 7. Vertex-Centric Programming  VENUS: • Only store mutable values on vertices  Pros • Less data access • Enable ``streamlined’’ processing  Cons • Limited expressiveness void update(v) fetch data from each in-edge update data on v spread data to each out-edge in-neighbor v
  • 9. VENUS Architecture  Disk storage (offline) • Sharding • Separation of edge data and vertex data  Computing model (online) • Load edge data sequentially • Execute the update function on each vertex • How to load vertex data and propagate updates
  • 10. Sharding  Graph cannot fit in RAM? • Split the graph into shards  Each shard corresponds to an interval of vertices: • G-shard: immutable structure of graph • In-edges of nodes in the interval • V-shard: mutable vertex values • Vertex values of all vertices in the shard  Structure table: all g-shards  Value table: all vertex data Vertex ID 1 2 3 4 5 6 7 8 9 10 1 1 12 Data
  • 11. Interval I1=[1,4] I2=[5,8] I3=[9,12] G-shard 7,9,10 → 1 6,10 → 2 1,2,6 → 3 1,2,6,7,10 → 4 6,7,8,11 → 5 1,10 → 6 3,10,11 → 7 3,6,11 → 8 2,3,4,10,11 → 9 11 → 10 4,6 → 11 2,3,9,10,11 → 12 V-shard I1∪{6,7,9,10} I2∪{1,3,10,11} I3∪{2,3,4,6}
  • 12. Vertex-Centric Streamlined Processing  V-shards are much smaller than g-shards • Load each v-shard entirely into memory  Scan each g-shard sequentially • Execute the update function in parallel
  • 13. Execution Load v-shard 1 7,9,10 → 1 6,10 → 2 1,2,6 → 3 1,2,6,7,10 → 4 Update v-shard 1 Load v-shard 2 6,7,8,11 → 5 1,10 → 6 3,10,11 → 7 3,6,11 → 8 Update v-shard 2 Load v-shard 3 2,3,4,10,11 → 9 11 → 10 4,6 → 11 2,3,9,10,11 → 12 Update v-shard 3 LoadingExecution Parallelize execution and loading
  • 14. Load and Update v-shards  Two I/O efficient algorithms • Algorithm 1: Extension of PSW in GraphChi (skip) • Algorithm 2: Merge-Join • Load: merge-join between value table and v-shard • Update: write values of [1,4] back to vertex table  Use value buffer to cache value table ID 1 2 3 4 5 6 7 8 9 10 1 1 12 Data Value table on disk ID 1 2 3 4 6 7 9 10 Vertices in v-shard 1 on disk ID 1 2 3 4 6 7 9 10 Data Loaded v-shard 1
  • 15. Evaluation of VENUS  Setup: a commodity PC • quad-core 3.4GHz CPU • 16GB RAM and 4TB hard disk  Main competitors: • GraphChi and X-Stream  Applications: • PageRank • WCC: weakly connected components • CD: community detection • ALS: alternating least square for collaborative filtering • Shortest path, label propagations, etc.
  • 16. PageRank on Twitter  Twitter follow-graph: 41M nodes, 1.4B edges
  • 17. Cost of Updates Propagation: Data Write and Read
  • 18. Applications: WCC, CD, ALS Failed to implement CD on X-Stream, due to its edge-centric programming model
  • 19. Web-Scale Graph  Clueweb12: web scale graph • 978 million nodes, 42.5 billion edges • 402 GB on disk • 2 iterations of PageRank  Computation time • GraphChi: 4.3 hours • X-Stream: 7.4 hours • VENUS-I: 2 hours • VENUS-II: 1.8 hours
  • 20. Conclusion  Present a disk-based graph computation system, VENUS  Our design of graph storage and execution can reduce data access and I/O  Evaluations show it outperforms GraphChi and X-Stream  Also VENUS can handle billion-scale problems