Application of GIS in Landslide Disaster Response.pptx
Graph500 and Green Graph500 Benchmarks on SGI UV2000
1. Graph500 and Green Graph500
benchmarks on SGI UV2000
*Yuichiro Yasui & Katsuki Fujisawa
Kyushu University and JST CREST
SGI User group conference
Nov. 17, 2014
2. Outline
1. Graph processing for large-scale networks
2. Graph500 & Green Graph500 benchmarks
3. Our NUMA-optimized BFS algorithm
4. Numerical results on SGI UV 2000
Top%down Bottom%up(
NUMA%aware
CPU
RAM
3. Graph processing for Large scale networks
• Large scale graphs in various fields
– US Road network : 58 million edges
– Twitter follow-ship : 1.47 billion edges
– Neuronal network : 100 trillion edges
• Fast and scalable graph processing by using HPC
Neuronal network @ Human Brain Project
89 billion vertices & 100 trillion edges
Cyber-security
Twitter
Social network
US road network
24 million vertices & 58 million edges 15 billion log entries / day
large
61.6 million vertices
& 1.47 billion edges
4. Graph analysis and important kernel BFS
• The cycle of graph analysis for understanding real-networks
64 results
Input parameters Input parameters Graph generation Graph Graph generation construction Graph construction BFS BFS Validation
TEPS
Input parameters Graph generation Graph construction Results
• Transportation
• Social network
• Cyber-security
• Bioinformatics
graph
processing
BFS Validation
Understanding
Application field
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed - TEPS
TEPS
ratio
64 Iterations
Relationships
- SCALE
- edgefactor
ratio
64 Iterations
graph
- SCALE
- edgefactor
Validation
Step1
Step2
Step3
constructing
• concurrent search (breadth-first search)
• optimization (single source shortest path)
• edge-oriented (maximal independent set)
5. Graph analysis and important kernel BFS
• The cycle of graph analysis for understanding real-networks
64 results
Input parameters Input parameters Graph generation Graph Graph generation construction Graph construction BFS BFS Validation
TEPS
Input parameters Graph generation Graph construction Results
• Transportation
• Social network
• Cyber-security
• Bioinformatics
graph
processing
BFS Validation
Understanding
• concurrent search (breadth-first search)
• optimization (single source shortest path)
• edge-oriented (maximal independent set)
Application field
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed - TEPS
TEPS
ratio
64 Iterations
Relationships
- SCALE
- edgefactor
ratio
64 Iterations
graph
- SCALE
- edgefactor
Validation
Step1
Step2
Step3
Breadth-first search (BFS)
constructing
• One of most important and fundamental processing
• Many algorithms and applications based on exists (Max.-flow and centrality)
• low arithmetic intensity & irregular memory accesses.
Source
BFS Lv. 3
source Lv. 1 Lv. 2
Outputs:Distance (Lv.)
and Predecessor for each
Inputs:Graph, vertex from source
and source vertex
6. Twitter follow-ship network
Twitter2009
• follow-ship network
– #Users (#vertices) 41,652,230
– Follow-ships (#edges) 2,405,026,092
BFS result from User 21,804,357
Lv. #users ratio (%) percentile (%)
0 1 0.00 0.00
1 7 0.00 0.00
2 6,188 0.01 0.01
3 510,515 1.23 1.24
4 29,526,508 70.89 72.13
5 11,314,238 27.16 99.29
6 282,456 0.68 99.97
7 11536 0.03 100.00
8 673 0.00 100.00
9 68 0.00 100.00
10 19 0.00 100.00
11 10 0.00 100.00
12 5 0.00 100.00
13 2 0.00 100.00
14 2 0.00 100.00
15 2 0.00 100.00
Total 41,652,230 100.00 -
This network excludes unconnected users
The six-degrees of
separation
Our algorithm
computes a BFS in 60 ms only
7. Betweenness centrality (BC)
σst : number of shortest (s, t)-paths
σst(v) : number of shortest (s, t)-paths passing through vertex CB(Highway
Bridge
• Definition
CB(v) =
!
s!v!t∈V
σst(v)
σst
v) =
!
s!v!t∈V
σst(v)
σst
σst : number of shortest (s, t)-paths
σst(v) : number of shortest (s, t)-paths passing through vertex v
# of (s,t)-shortest paths
# of (s,t)-shortest paths
passing throw v
Osaka road network
13,076 vertices and 40,528 edges
• BC measures important vertices
and edges without coordinates
High(score(vertex/edge(=(Important(place(
c.g.)(Highway,(Bridge
• BC requires the all-to-all shortest paths
• BFS => one-to-all
• <#vertices> times BFS => all-to-all
=>(13,076(times(BFS(computations
8. Graph500 Benchmark
www.graph500.org
• Measures a performance of irregular memory accesses
• TEPS score (# of Traversed edges per second) in a BFS
Input parameters for problem size
SCALE & edgefactor (=16)
Input parameters Graph generation Graph construction Results
Input parameters Graph generation Graph construction Results
BFS Validation
Median
TEPS
1. Generation
SCALE
edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
BFS Validation
BFS Validation
TEPS
ratio
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
TEPS
ratio
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS - Traversed - TEPS
Input parameters Graph generation Graph construction TEPS
ratio
64 Iterations
2. Construction 3. BFS x 64
x 64
TEPS ratio
• Generates synthetic scale-free network with 2SCALE vertices and
2SCALE×edgefactor edges by using SCALE-times the Rursive
Kronecker products
G1 G2 G3 G4
Kronecker graph
9. Green Graph500 Benchmark
http://green.graph500.org
• Measures power-efficient using TEPS/W score
• Our results on various systems such as SGI UV series
and Xeon servers, Android devices
parameters Graph generation Graph construction Results
Median
TEPS
1. Generation
BFS Validation
Input parameters Graph generation Graph construction BFS Validation
Results
Input parameters Graph generation Graph construction Results
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
BFS Validation
TEPS
ratio
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
TEPS
ratio
64 Iterations
SCALE
edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
TEPS
ratio
64 Iterations
3. BFS phase
2. Construction x 64
TEPS ratio
Watt
TEPS/W
Power measurement Green Graph500
Graph500
15. Problem and Our motivation
• Does UV2000 obtain a high-performance without MPI?
Thread ? MPI
Thread << MPI
K computer
Thread > MPI
Single-server UV2000
1 4 32 640 1280 512K # of cores
16. Problem and Our motivation
• Does UV2000 obtain a high-performance without MPI?
Thread ≈ MPI Thread << MPI
K computer
Thread > MPI
Single-server UV2000
1 4 32 640 1280 512K # of cores
• Exploiting Algorithm on NUMA and cc-NUMA system
– Automatic processor topology detection
Node = 2 sockets Cube = 8 nodes Rack = 32 nodes
CPU
RAM
CPU
RAM
18. Level-synchronized parallel BFS (Top-down)
• Started from source vertex
and executes following two
phases for each level
iterations (timed).: This step iterates the timed
untimed verify-phase 64 times. The BFS-phase
BFS for each source, and the verify-phase
output of the BFS.
benchmark is based on the TEPS ratio, which is
Traversal finds neighbors QN
from current frontier QF
given graph and the BFS output. Submission
benchmark must report five TEPS ratios: the
quartile, median, third quartile, and maxi-mum.
Unvisited adjacency vertices(
PARALLEL BFS ALGORITHM
synchronized Parallel BFS
QN
the input of a BFS is a graph G = (V,E)
set of vertices V and a set of edges E.
of G are contained as pairs (v,w), where
set of edges E corresponds to a set of
where an adjacency list A(v) contains
edges (v,w) ∈ E for each vertex v ∈ V. A
various edges spanning all other vertices
the source vertex s ∈ V in a given graph
predecessor map π, which is a map from
its parent. When the predecessor map π(v)
Algorithm 1: Level-synchronized Parallel BFS.
Input : G = (V,A) : unweighted directed graph.
s : source vertex.
Variables: QF : frontier queue.
QN : neighbor queue.
visited : vertices already visited.
Output : π(v) : predecessor map of BFS tree.
1 π(v)←−1, #v ∈ V
2 π(s) ← s
3 visited ← {s}
4 QF ← {s}
5 QN ← ∅
6 while QF̸= ∅ do
7 for v ∈ QF in parallel do
8 for w ∈ A(v) do
9 if w̸∈ visited atomic then
10 π(w) ← v
11 visited ← visited ∪ {w}
12 QN ← QN ∪ {w}
13 QF ← QN
14 QN ← ∅
Traversal
Swap
Frontier
Neighbor
Level k Level k+1 Q F
Swap exchanges the frontier
QF and the neighbors QN for
next level
19. Direction-optimizing BFS
Chooses one from Top-down or Bottom-up Beamer2012 @ SC2012
前方探索と後方探索でのデータアクセスの観察
Bottom-up algorithm
• Efficient for large-frontier
• Uses in-coming edges
• 後方探索でのデータの書込み
Candidates of
neighbors
前方探索と後方探索でのデータアクセスの観察
Top-down algorithm
• Efficient for small-frontier
• Uses out-going edges
• 前方探索でのデータの書込み
• 前方探索でのデータの書込み
Outgoing
edges Incoming
v
Frontier
v → w
Level7k
v → w
v
Neighbors
Level7k+1
w
w
Input : Directed graph G = (V, AF ), Queue QF
Data : Queue QN, visited, Tree π(v)
QN ← ∅
for v ∈ QF in parallel do
Input : Directed graph G = (V, AF ), Queue QF
Data : Queue QN, visited, Tree π(v)
QN ← ∅
for v ∈ QF in parallel do
Candidates of
edges
Current for for w ∈ w AF ∈ AF (v) (do
v) do
if w if ! w visited ! visited atomic atomic then
then
π(w) π(w) ← ← v
v
visited visited ←←visited visited ∪ ∪ QN {w}
{w}
QN ← QN QN ∪ {w}
QF ← QN
← ∪ {w}
どちらもに関する変数とに書込みを行っているは点番号の参照• 後方探索でのデータの書込み
v w
Frontier
Level7k
w w → v
v
v w
Level7k+1
neighbors
Input : Directed graph G = (V, AB), Queue QF
Data : Queue QN, visited, Tree π(v)
QN ← ∅ for w do
Input : Directed graph G = (V, AB), Queue QF
Data Tree π(v)
QN for w ∈ V visited in parallel do
for v ∈ AB(w) do
if v ∈ QF then
π(w) ← v
visited ←visited ∪ {w}
QN ← QN ∪ {w} break
for QF ← QN
QF ← QN
π(w) v
visited ←visited ∪ {w}
QN ← QN ∪ {w} break
QF ← QN
frontier
Unvisited
neighbors
neighbors
Current frontier
Skips unnecessary edge traversal
• どちらもw に関する変数π(w) とvisited に書込みを行っている(v は点番号の参照)
21. NUMA-optimized Dir. Opt. BFS
50
• Manages memory accesses on NUMA system
40
30
20
10
0
– Each NUMA node contains CPU socket and local memory
2011
NUMA%aware
SC10
Top%down
SC12
BigData13
ISC14
G500,ISC14
GTEPS
Reference
NUMA-aware
Dir.Opt.
NUMA-Opt.
NUMA-Opt. + Deg.aware
NUMA-Opt. + Deg.aware + Vtx.Sort
87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489
• CPU: Intel Xeon
• #sockets: 4
• #cores: 32 or 40
• RAM: 256GB or 512GB
Top%down Top%down
Bottom%up(
CPU
RAM
System7configuration
22. NUMA-optimized Dir. Opt. BFS
50
• Manages memory accesses on NUMA system
40
30
20
10
0
– Each NUMA node contains CPU socket and local memory
2011
NUMA%aware Bottom%up(
SC10
Top%down
SC12
NUMA%aware
Our results
BigData13
Partitioning
0 1 2 3
ISC14
Adjacency
Matrix
Binding on NUMA
processor core L2 cache
RAM RAM
processor core L2 cache
processor core L2 cache
0th 3th
1st 2nd
RAM RAM
RAM RAM
shared L3 cache 8-core Xeon E5 4640
RAM RAM
RAM RAM
shared L3 cache 8-core Xeon E5 4640
shared L3 cache 8-core Xeon E5 4640
G500,ISC14
GTEPS
Reference
NUMA-aware
Dir.Opt.
NUMA-Opt.
NUMA-Opt. + Deg.aware
NUMA-Opt. + Deg.aware + Vtx.Sort
87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489
• CPU: Intel Xeon
• #sockets: 4
• #cores: 32 or 40
• RAM: 256GB or 512GB
Top%down Top%down
Bottom%up(
CPU
RAM
Top%down
CPU
RAM
System7configuration
23. NUMA architecture
• 4-way Intel Xeon E5-4640 (Sandybridge-EP)
– 4 (# of CPU sockets)
– 8 (# of physical cores per socket)
– 2 (# of threads per core)
NUMA node
processor core L2 cache
Max.
4 x 8 x 2 = 64 threads
Memory access for Local RAM(Fast)
RAM RAM
RAM RAM
shared L3 cache 8-core Xeon E5 4640
NUMA node
CPU socket(16 logical cores)
+ Local RAM
Memory access for Remote RAM(Slow)
NUMA-aware (optimized) computation
• Reduces and avoids memory accesses for Remote RAM
24. Flow of affinities using ULIBC
ULIBC : Ubiquity Library for Intelligently Binding Cores
– provides some APIs to utilizing processor topology easily.
(Our(library)
25. Flow of affinities using ULIBC
ULIBC : Ubiquity Library for Intelligently Binding Cores
– provides some APIs to utilizing processor topology easily.
1. Detects entire topology
Use(Other(
processes
Cores
CPU 0 P0, P4, P8, P12
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
CPU 3 P3, P7, P11, P15
(Our(library)
26. Flow of affinities using ULIBC
ULIBC : Ubiquity Library for Intelligently Binding Cores
– provides some APIs to utilizing processor topology easily.
1. Detects entire topology
2. Detects online (available) topology
Cores
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
Job manager (PBS) or
numactl --cpunodebind=1,2
Use(Other(
processes
Cores
CPU 0 P0, P4, P8, P12
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
CPU 3 P3, P7, P11, P15
(Our(library)
27. Flow of affinities using ULIBC
1. Detects entire topology
2. Detects online (available) topology
Cores
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
Job manager (PBS) or
numactl --cpunodebind=1,2
3. Constructs ULIBC affinity
core 0
Threads
Use(Other(
processes
Cores
CPU 0 P0, P4, P8, P12
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
CPU 3 P3, P7, P11, P15
ULIBC_set_affinity_policy(
7, SCATTER_MAPPING, THREAD_TO_CORE)
core 3
Local RAM
RAM
NUMA 0 0(P1), 2(P5), 4(P9), 6(P13)
NUMA 1 1(P2), 3(P6), 5(P10)
# of threads
Scatter-type mapping
Each thread binds each logical cores
NUMA 0
NUMA 1
core 1c ore 2
RAM
ULIBC : Ubiquity Library for Intelligently Binding Cores
– provides some APIs to utilizing processor topology easily.
(Our(library)
28. NUMA-optimized BFS
• The 1-D column-wise partitioning for adjacency matrix
Partitioning
Adjacency 0 1 2 3
Matrix
processor core L2 cache
0th 3th
RAM RAM
processor core L2 cache
RAM RAM
RAM RAM
shared L3 cache 8-core Xeon E5 4640
RAM RAM
RAM RAM
1st 2nd
• Local traversal and all-to-all comm. for each level
Edge traversal on local RAM All-gathering of next frontier
Each NUMA node searches unvisited
vertices from duplicated frontier
Out Out Out Out
In In In In
0 1 2 3
processor core L2 cache
processor core L2 cache
RAM RAM
RAM RAM
RAM RAM
processor core L2 cache
processor core L2 cache
RAM RAM
RRAAMM RAMRAM
RAM RAM
RAM
shared L3 cache 8-core Xeon E5 4640
RAM RAM
RAM RAM
shared L3 cache 8-core Xeon E5 4640
processor core L2 cache
RAM RAM
RAM RAM
shared L3 cache 8-core Xeon E5 4640
processor core L2 cache
shared L3 cache 8-core Xeon E5 4640
shared L3 cache 8-core Xeon E5 4640
shared L3 cache 8-core Xeon E5 4640
Construct duplicated frontiers from
partial neighbors
Local neighbors
Duplicated
frontiers
processor core L2 cache
shared L3 cache 8-core Xeon E5 4640
shared L3 cache 8-core Xeon E5 4640
Binding
Inner-NUMA-node Inter-NUMA-node
29. Degree-aware + NUMA-opt. + Dir. Opt. BFS
50
• Manages memory accesses on NUMA system
40
30
20
10
0
– Each NUMA node contains CPU socket and local memory
2011
NUMA%aware Bottom%up(
SC10
Top%down
SC12
NUMA%aware
BigData13
ISC14
Our results
1. Deleting isolated vertices
Isolated
2. Sorting adjacency vertices
A(va)
… …
A(va)
… …
Sorted by degree
G500,ISC14
GTEPS
Reference
NUMA-aware
Dir.Opt.
NUMA-Opt.
NUMA-Opt. + Deg.aware
NUMA-Opt. + Deg.aware + Vtx.Sort
87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489
• CPU: Intel Xeon
• #sockets: 4
• #cores: 32 or 40
• RAM: 256GB or 512GB
Top%down Top%down
Bottom%up(
CPU
RAM
Top%down
CPU
RAM
System7configuration
30. TEPS and TEPS/W on single-server
for Graph500 for Green Graph500
• Strong scaling for SCALE 27
64
32
16
8
4
2
1
4-way SB-EP based Xeon
1 2 4 8 16 32 64
64
32
16
8
4
2
1
relative GTEPS
relative MTEPS/W
Number of threads
GTEPS
MTEPS/W
29.03 GTEPS
45.43 MTEPS/W
Relative improvements
Number of threads
x 27.9
x 12.6
31. SGI UV 2000 system
• Shared-memory supercomputer
– handle large memory space using thread parallel.
– C/C++ with OpenMP/Pthreads (w/o MPI comm.)
– cc-NUMA architecture system base on Intel Xeon
• ISM has two Full-spec. UV 2000
– 4 UV 2000 racks
– Up to 2,560 cores and 64 TB memory
• ISM, SGI, and us collaborate for Graph500
– achieves the fastest of single-node in current list
The Institute of Statistical Mathematics
• Japan's national research institute for
statistical science.
UV2000 rack
#1 system #2 system
32. SGI UV 2000 configuration
• UV2000 has complex hardware topologies
– Socket, Node, Cube, Inner-rack, and Inter-rack
Node = 2 sockets Cube = 8 nodes Rack = 32 nodes
CPU
RAM
CPU
RAM
× 4 =
• We used NUMA-based flat parallelization
– Each NUMA node contains a “Xeon CPU E5-2470 v2” and a “256 GB RAM”
Node = 2 NUMA nodes Rack = 64 NUMA nodes
CPU
RAM
× 64 =
CPU
RAM
Cube = 16 NUMA nodes
× 2
CPU
RAM
× 16
NUMAlink(
6.7GB/s
(20(cores,(512GB) (160(cores,(4TB) (640(cores,(16TB)
34. The Graph500 List in June 2014
hp://www.graph500.org
• Measures performance using TEPS (# of Traversed edges
per second) in graph traversal such as BFS
Fastest.of.
......single5node
Fastest.of..
......single5server
Distributed
Memory
Distributed
Memory
Shared
Memory
Distributed
Memory
Shared
Memory
Fastest.of..
mul:5node
35. The Green Graph500 List in June 2014
http://green.graph500.org
Big Data category ( SCALE 30)
Small data category ( SCALE 29)
SONY Xperia-Z1-SO-01F
Measures power-efficiency using TEPS/W
George Washington University’s
Colonial
is ranked
No.1
in the Small Data category of the Green Graph 500
Ranking of Supercomputers with
445.92 MTEPS/W on Scale 20
on the third Green Graph 500 list published at the
International Supercomputing Conference, June 23, 2014.
Congratulations from the Green Graph 500 Chair
Kyushu’s University
GraphCREST-SandybridgeEP-2.4GHz
is ranked
No.1
in the Big Data category of the Green Graph 500
Ranking of Supercomputers with
59.12 MTEPS/W on Scale 30
on the third Green Graph 500 list published at the
International Supercomputing Conference, June 23, 2014.
Congratulations from the Green Graph 500 Chair
Ours
UV2000
Ours
4-way Xeon server
TSUBAME-KFC
36. Weak scaling on UV2000
200
150
100
50
0
26
(` = 1)
June 2014
Nov. 2014
27
(` = 2)
Weak scaling on UV 2000
28
(` = 4)
Fastest of single node
in Graph500 June
29
(` = 8)
131 GTEPS
30
(` = 16)
31
(` = 32)
32
(` = 64)
New result
174 GTEPS
33
Two racks
(` = 128)
34
(` = 256)
GTEPS SCALE (` = #sockets)
Inner%rack(comm. Inter%rack(comm.
37. Conclusion
• UV 2000 with NUMA-based thread-parallelization
– Scalable for irregular memory access computation
• Graph500/Green Graph500 on UV 2000
– 131 GTEPS with 640 threads
– The fastest of single node entries
– The most power-efficient of commercial supercomputers
174 GTEPS for SCALE 33 with 1,280 threads
SGI UV2000
CPU : 64 CPUs per rack
RAM : 16 TB per rack
Graph500 Green Graph500
• ULIBC will be available at
https://bitbucket.org/yuichiro_yasui/ulibc