Dynamic graph / iterative computation on Apache Giraph. Apache Giraph is an iterative graph processing system inspired by Pregel that runs on Hadoop. It allows modeling problems as a series of messages passed between graph vertices. Giraph has been used for applications like page rank, affinity propagation, and mutual friends calculation. It provides faster computation than Hive and scales to trillions of edges. Future work includes automatic checkpointing and investigating alternative computing models.
3. Apache Giraph
• Inspired by Google’s Pregel but runs on Hadoop
• “Think like a vertex”
• Maximum value vertex example
Processor 1
Processor 2
Time
5
5
5
5
2
5
5
5
2
1
5
5
2
1
9. Affinity propagation
Frey and Dueck “Clustering by passing messages between data points”
Science 2007
Organically discover exemplars based on similarity
Initialization Intermediate Convergence
10. Responsibility r(i,k)
• How well suited is k to be an exemplar for i?
Availability a(i,k)
• How appropriate for point i to choose point k as an exemplar given all
of i’s responsibilities?
Update exemplars
• Based on known responsibilities/availabilities, which vertex should be
my exemplar?
!
* Dampen responsibility, availability
3 stages
11. Responsibility
Every vertex i with an edge to k maintains responsibility of k for i
Sends responsibility to k in ResponsibilityMessage (senderid,
responsibility(i,k))
C
A
D
B
r(c,a)
r(d,a)
r(b,d)
r(b,a)
12. Availability
Vertex sums positive messages
Sends availability to i in AvailabilityMessage (senderid, availability(i,k))
C
A
D
B
a(c,a)
a(d,a)
a(b,d)
a(b,a)
13. Update exemplars
Dampens availabilities and scans edges to find exemplar k
Updates self-exemplar
C
A
D
Bupdate update
update update
exemplar=a exemplar=d
exemplar=a exemplar=a
25. Avoiding out-of-core
Example: Mutual friends calculation between
neighbors
1. Send your friends a list of your friends
2. Intersect with your friend list
!
1.23B (as of 1/2014)
200+ average friends (2011 S1)
8-byte ids (longs)
= 394 TB / 100 GB machines
3,940 machines (not including the graph)
A B
C
D
E
A:{D}
D:{A,E}
E:{D}
B:{}
C:{D}
D:{C}
A:{C}
C:{A,E}
E:{C}
!
C:{D}
D:{C}
!
!
E:{}
26.
27. Superstep splitting
Subsets of sources/destinations edges per superstep
* Currently manual - future work automatic!
A
Sources: A (on), B (off)
Destinations: A (on), B (off)
B
B
B
A
A
A
Sources: A (on), B (off)
Destinations: A (off), B (on)
B
B
B
A
A
A
Sources: A (off), B (on)
Destinations: A (on), B (off)
B
B
B
A
A
A
Sources: A (off), B (on)
Destinations: A (off), B (on)
B
B
B
A
A
28. Giraph in production
Over 1.5 years in production
Over 100 jobs processed a week
30+ applications in our internal application repository
Sample production job - 700B+ edge
32. Future work
Automatic checkpointing make scheduling more fair
Investigate alternative computing models
•Giraph++ (IBM research)
•Giraphx (University at Buffalo, SUNY)
Performance
Lower the barrier to entry
Applications