SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Graph500 and Green Graph500 
benchmarks on SGI UV2000 
*Yuichiro Yasui & Katsuki Fujisawa 
Kyushu University and JST CREST 
SGI User group conference 
Nov. 17, 2014
Outline 
1. Graph processing for large-scale networks 
2. Graph500 & Green Graph500 benchmarks 
3. Our NUMA-optimized BFS algorithm 
4. Numerical results on SGI UV 2000 
Top%down Bottom%up( 
NUMA%aware 
CPU 
RAM
Graph processing for Large scale networks 
• Large scale graphs in various fields 
– US Road network : 58 million edges 
– Twitter follow-ship : 1.47 billion edges 
– Neuronal network : 100 trillion edges 
• Fast and scalable graph processing by using HPC 
Neuronal network @ Human Brain Project 
89 billion vertices & 100 trillion edges 
Cyber-security 
Twitter 
Social network 
US road network 
24 million vertices & 58 million edges 15 billion log entries / day 
large 
61.6 million vertices 
& 1.47 billion edges
Graph analysis and important kernel BFS 
• The cycle of graph analysis for understanding real-networks 
64 results 
Input parameters Input parameters Graph generation Graph Graph generation construction Graph construction BFS BFS Validation 
TEPS 
Input parameters Graph generation Graph construction Results 
• Transportation 
• Social network 
• Cyber-security 
• Bioinformatics 
graph 
processing 
BFS Validation 
Understanding 
Application field 
- SCALE 
- edgefactor 
- SCALE 
- edgefactor 
- BFS Time 
- Traversed - TEPS 
TEPS 
ratio 
64 Iterations 
Relationships 
- SCALE 
- edgefactor 
ratio 
64 Iterations 
graph 
- SCALE 
- edgefactor 
Validation 
Step1 
Step2 
Step3 
constructing 
• concurrent search (breadth-first search) 
• optimization (single source shortest path) 
• edge-oriented (maximal independent set)
Graph analysis and important kernel BFS 
• The cycle of graph analysis for understanding real-networks 
64 results 
Input parameters Input parameters Graph generation Graph Graph generation construction Graph construction BFS BFS Validation 
TEPS 
Input parameters Graph generation Graph construction Results 
• Transportation 
• Social network 
• Cyber-security 
• Bioinformatics 
graph 
processing 
BFS Validation 
Understanding 
• concurrent search (breadth-first search) 
• optimization (single source shortest path) 
• edge-oriented (maximal independent set) 
Application field 
- SCALE 
- edgefactor 
- SCALE 
- edgefactor 
- BFS Time 
- Traversed - TEPS 
TEPS 
ratio 
64 Iterations 
Relationships 
- SCALE 
- edgefactor 
ratio 
64 Iterations 
graph 
- SCALE 
- edgefactor 
Validation 
Step1 
Step2 
Step3 
Breadth-first search (BFS) 
constructing 
• One of most important and fundamental processing 
• Many algorithms and applications based on exists (Max.-flow and centrality) 
• low arithmetic intensity & irregular memory accesses. 
Source 
BFS Lv. 3 
source Lv. 1 Lv. 2 
Outputs:Distance (Lv.) 
and Predecessor for each 
Inputs:Graph, vertex from source 
and source vertex
Twitter follow-ship network 
Twitter2009 
• follow-ship network 
– #Users (#vertices) 41,652,230 
– Follow-ships (#edges) 2,405,026,092 
BFS result from User 21,804,357 
Lv. #users ratio (%) percentile (%) 
0 1 0.00 0.00 
1 7 0.00 0.00 
2 6,188 0.01 0.01 
3 510,515 1.23 1.24 
4 29,526,508 70.89 72.13 
5 11,314,238 27.16 99.29 
6 282,456 0.68 99.97 
7 11536 0.03 100.00 
8 673 0.00 100.00 
9 68 0.00 100.00 
10 19 0.00 100.00 
11 10 0.00 100.00 
12 5 0.00 100.00 
13 2 0.00 100.00 
14 2 0.00 100.00 
15 2 0.00 100.00 
Total 41,652,230 100.00 - 
This network excludes unconnected users 
The six-degrees of 
separation 
Our algorithm 
computes a BFS in 60 ms only
Betweenness centrality (BC) 
σst : number of shortest (s, t)-paths 
σst(v) : number of shortest (s, t)-paths passing through vertex CB(Highway 
Bridge 
• Definition 
CB(v) = 
! 
s!v!t∈V 
σst(v) 
σst 
v) = 
! 
s!v!t∈V 
σst(v) 
σst 
σst : number of shortest (s, t)-paths 
σst(v) : number of shortest (s, t)-paths passing through vertex v 
# of (s,t)-shortest paths 
# of (s,t)-shortest paths 
passing throw v 
Osaka road network 
13,076 vertices and 40,528 edges 
• BC measures important vertices 
and edges without coordinates 
High(score(vertex/edge(=(Important(place( 
c.g.)(Highway,(Bridge 
• BC requires the all-to-all shortest paths 
• BFS => one-to-all 
• <#vertices> times BFS => all-to-all 
=>(13,076(times(BFS(computations
Graph500 Benchmark 
www.graph500.org 
• Measures a performance of irregular memory accesses 
• TEPS score (# of Traversed edges per second) in a BFS 
Input parameters for problem size 
SCALE & edgefactor (=16) 
Input parameters Graph generation Graph construction Results 
Input parameters Graph generation Graph construction Results 
BFS Validation 
Median 
TEPS 
1. Generation 
SCALE 
edgefactor 
- SCALE 
- edgefactor 
- BFS Time 
- Traversed edges 
- TEPS 
BFS Validation 
BFS Validation 
TEPS 
ratio 
64 Iterations 
- SCALE 
- edgefactor 
- SCALE 
- edgefactor 
- BFS Time 
- Traversed edges 
- TEPS 
TEPS 
ratio 
64 Iterations 
- SCALE 
- edgefactor 
- SCALE 
- edgefactor 
- BFS - Traversed - TEPS 
Input parameters Graph generation Graph construction TEPS 
ratio 
64 Iterations 
2. Construction 3. BFS x 64 
x 64 
TEPS ratio 
• Generates synthetic scale-free network with 2SCALE vertices and 
2SCALE×edgefactor edges by using SCALE-times the Rursive 
Kronecker products 
G1 G2 G3 G4 
Kronecker graph
Green Graph500 Benchmark 
http://green.graph500.org 
• Measures power-efficient using TEPS/W score 
• Our results on various systems such as SGI UV series 
and Xeon servers, Android devices 
parameters Graph generation Graph construction Results 
Median 
TEPS 
1. Generation 
BFS Validation 
Input parameters Graph generation Graph construction BFS Validation 
Results 
Input parameters Graph generation Graph construction Results 
- SCALE 
- edgefactor 
- SCALE 
- edgefactor 
- BFS Time 
- Traversed edges 
- TEPS 
BFS Validation 
TEPS 
ratio 
64 Iterations 
- SCALE 
- edgefactor 
- SCALE 
- edgefactor 
- BFS Time 
- Traversed edges 
- TEPS 
TEPS 
ratio 
64 Iterations 
SCALE 
edgefactor 
- SCALE 
- edgefactor 
- BFS Time 
- Traversed edges 
- TEPS 
TEPS 
ratio 
64 Iterations 
3. BFS phase 
2. Construction x 64 
TEPS ratio 
Watt 
TEPS/W 
Power measurement Green Graph500 
Graph500
Target networks 
45 
40 
35 
30 
25 
20 
Graph500 (Small) 
Graph500 (Mini) 
twitter-rv 
USA-road-d.USA.gr 
USA-road-d.LKS.gr 
soc-LiveJournal1 
cit-Patents 
Graph500 (Huge) 
Graph500 (Large) 
15 20 25 30 35 40 45 
log2(m) 
log2(n) 
USA-road-d.NY.gr 
Human Project 
Graph500 (Toy) 
Graph500 (Medium) 
1blillion 1trillion 
1blillion 
1trillion 
Human Brain 
89 B, 100 T 
Twitter2009 
#(of(vertices((in(logscale) 
#(of(edges((in(logscale) 
US road 
network
Target networks on Smartphone 
45 
40 
35 
30 
25 
20 
Graph500 (Small) 
Graph500 (Mini) 
twitter-rv 
USA-road-d.USA.gr 
USA-road-d.LKS.gr 
soc-LiveJournal1 
cit-Patents 
Graph500 (Huge) 
Graph500 (Large) 
Smartphone 
15 20 25 30 35 40 45 
log2(m) 
log2(n) 
USA-road-d.NY.gr 
Human Project 
Graph500 (Toy) 
Graph500 (Medium) 
1blillion 1trillion 
1blillion 
1trillion 
Human Brain 
89 B, 100 T 
Twitter2009 
Graph500 (SCALE20) 
・Smartphone (4 cores) 
US road 
network 
#(of(vertices((in(logscale) 
#(of(edges((in(logscale)
Target networks on Single-server 
45 
40 
35 
30 
25 
20 
Graph500 (Small) 
Graph500 (Mini) 
twitter-rv 
USA-road-d.USA.gr 
USA-road-d.LKS.gr 
soc-LiveJournal1 
cit-Patents 
Graph500 (Huge) 
Graph500 (Large) 
15 20 25 30 35 40 45 
log2(m) 
log2(n) 
USA-road-d.NY.gr 
Human Project 
Graph500 (Toy) 
Graph500 (Medium) 
1blillion 1trillion 
1blillion 
1trillion 
Human Brain 
89 B, 100 T 
Twitter2009 
Graph500 (SCALE29) 
・4-way Intel Xeon (64 cores) 
Graph500 (SCALE20) 
・Smartphone (4 cores) 
Single server 
Smartphone 
US road 
network 
#(of(vertices((in(logscale) 
#(of(edges((in(logscale)
Target networks on UV2000 
45 
40 
35 
30 
25 
20 
Graph500 (Small) 
Graph500 (Mini) 
twitter-rv 
USA-road-d.USA.gr 
USA-road-d.LKS.gr 
soc-LiveJournal1 
cit-Patents 
Graph500 (Huge) 
Graph500 (Large) 
15 20 25 30 35 40 45 
log2(m) 
log2(n) 
USA-road-d.NY.gr 
Human Project 
Graph500 (Toy) 
Graph500 (Medium) 
1blillion 1trillion 
1blillion 
1trillion 
Human Brain 
89 B, 100 T 
Twitter2009 
Graph500 (SCALE29) 
・4-way Intel Xeon (64 cores) 
Graph500 (SCALE32) 
・UV2000 (1rack, 640 cores) 
Graph500 (SCALE20) 
・Smartphone (4 cores) 
Single server 
UV 2000 
Smartphone 
US road 
network 
#(of(vertices((in(logscale) 
#(of(edges((in(logscale)
Target networks on Supercomputer 
45 
40 
35 
30 
25 
20 
Graph500 (Small) 
Twitter2009 
Graph500 (Mini) 
twitter-rv 
USA-road-d.USA.gr 
USA-road-d.LKS.gr 
soc-LiveJournal1 
cit-Patents 
Graph500 (Huge) 
Graph500 (Large) 
Graph500 (SCALE40) 
・BlueGene/Q (64K nodes) 
・K computer (64K nodes) 
15 20 25 30 35 40 45 
log2(m) 
log2(n) 
USA-road-d.NY.gr 
Human Project 
Graph500 (Toy) 
Graph500 (Medium) 
1blillion 1trillion 
1blillion 
1trillion 
Human Brain 
89 B, 100 T 
US road 
network 
Graph500 (SCALE29) 
・4-way Intel Xeon (64 cores) 
Graph500 (SCALE32) 
・UV2000 (1rack, 640 cores) 
Graph500 (SCALE20) 
・Smartphone (4 cores) 
Single server 
UV 2000 
K and Sequoia 
Smartphone 
#(of(vertices((in(logscale) 
#(of(edges((in(logscale)
Problem and Our motivation 
• Does UV2000 obtain a high-performance without MPI? 
Thread ? MPI 
Thread << MPI 
K computer 
Thread > MPI 
Single-server UV2000 
1 4 32 640 1280 512K # of cores
Problem and Our motivation 
• Does UV2000 obtain a high-performance without MPI? 
Thread ≈ MPI Thread << MPI 
K computer 
Thread > MPI 
Single-server UV2000 
1 4 32 640 1280 512K # of cores 
• Exploiting Algorithm on NUMA and cc-NUMA system 
– Automatic processor topology detection 
Node = 2 sockets Cube = 8 nodes Rack = 32 nodes 
CPU 
RAM 
CPU 
RAM
× 4 = – Affinity configurations for running threads and allocating memory 
processor core  L2 cache 
0th 3th 
RAM RAM 
processor core  L2 cache 
processor core  L2 cache 
RAM RAM 
RAM RAM 
shared L3 cache 8-core Xeon E5 4640 
RAM RAM 
RAM RAM 
1st 2nd 
shared L3 cache 8-core Xeon E5 4640 
shared L3 cache 8-core Xeon E5 4640 
Partitioning Binding 
Adjacency 0 1 2 3 
Matrix 
YES
Level-synchronized parallel BFS (Top-down) 
• Started from source vertex 
and executes following two 
phases for each level 
iterations (timed).: This step iterates the timed 
untimed verify-phase 64 times. The BFS-phase 
BFS for each source, and the verify-phase 
output of the BFS. 
benchmark is based on the TEPS ratio, which is 
Traversal finds neighbors QN 
from current frontier QF 
given graph and the BFS output. Submission 
benchmark must report five TEPS ratios: the 
quartile, median, third quartile, and maxi-mum. 
Unvisited adjacency vertices( 
PARALLEL BFS ALGORITHM 
synchronized Parallel BFS 
QN 
the input of a BFS is a graph G = (V,E) 
set of vertices V and a set of edges E. 
of G are contained as pairs (v,w), where 
set of edges E corresponds to a set of 
where an adjacency list A(v) contains 
edges (v,w) ∈ E for each vertex v ∈ V. A 
various edges spanning all other vertices 
the source vertex s ∈ V in a given graph 
predecessor map π, which is a map from 
its parent. When the predecessor map π(v) 
Algorithm 1: Level-synchronized Parallel BFS. 
Input : G = (V,A) : unweighted directed graph. 
s : source vertex. 
Variables: QF : frontier queue. 
QN : neighbor queue. 
visited : vertices already visited. 
Output : π(v) : predecessor map of BFS tree. 
1 π(v)←−1, #v ∈ V 
2 π(s) ← s 
3 visited ← {s} 
4 QF ← {s} 
5 QN ← ∅ 
6 while QF̸= ∅ do 
7 for v ∈ QF in parallel do 
8 for w ∈ A(v) do 
9 if w̸∈ visited atomic then 
10 π(w) ← v 
11 visited ← visited ∪ {w} 
12 QN ← QN ∪ {w} 
13 QF ← QN 
14 QN ← ∅ 
Traversal 
Swap 
Frontier 
Neighbor 
Level k Level k+1 Q F 
Swap exchanges the frontier 
QF and the neighbors QN for 
next level
Direction-optimizing BFS 
Chooses one from Top-down or Bottom-up Beamer2012 @ SC2012 
前方探索と後方探索でのデータアクセスの観察 
Bottom-up algorithm 
• Efficient for large-frontier 
• Uses in-coming edges 
• 後方探索でのデータの書込み 
Candidates of 
neighbors 
前方探索と後方探索でのデータアクセスの観察 
Top-down algorithm 
• Efficient for small-frontier 
• Uses out-going edges 
• 前方探索でのデータの書込み 
• 前方探索でのデータの書込み 
Outgoing 
edges Incoming 
v 
Frontier 
v → w 
Level7k 
v → w 
v 
Neighbors 
Level7k+1 
w 
w 
Input : Directed graph G = (V, AF ), Queue QF 
Data : Queue QN, visited, Tree π(v) 
QN ← ∅ 
for v ∈ QF in parallel do 
Input : Directed graph G = (V, AF ), Queue QF 
Data : Queue QN, visited, Tree π(v) 
QN ← ∅ 
for v ∈ QF in parallel do 
Candidates of 
edges 
Current for for w ∈ w AF ∈ AF (v) (do 
v) do 
if w if ! w visited ! visited atomic atomic then 
then 
π(w) π(w) ← ← v 
v 
visited visited ←←visited visited ∪ ∪ QN {w} 
{w} 
QN ← QN QN ∪ {w} 
QF ← QN 
← ∪ {w} 
どちらもに関する変数とに書込みを行っているは点番号の参照• 後方探索でのデータの書込み 
v w 
Frontier 
Level7k 
w w → v 
v 
v w 
Level7k+1 
neighbors 
Input : Directed graph G = (V, AB), Queue QF 
Data : Queue QN, visited, Tree π(v) 
QN ← ∅ for w do 
Input : Directed graph G = (V, AB), Queue QF 
Data Tree π(v) 
QN for w ∈ V  visited in parallel do 
for v ∈ AB(w) do 
if v ∈ QF then 
π(w) ← v 
visited ←visited ∪ {w} 
QN ← QN ∪ {w} break 
for QF ← QN 
QF ← QN 
π(w) v 
visited ←visited ∪ {w} 
QN ← QN ∪ {w} break 
QF ← QN 
frontier 
Unvisited 
neighbors 
neighbors 
Current frontier 
Skips unnecessary edge traversal 
• どちらもw に関する変数π(w) とvisited に書込みを行っている(v は点番号の参照)
Direction-optimizing BFS 
Chooses one from Top-down or Bottom-up Beamer2012 @ SC2012 
Small frontier large frontier 
探索に# 対of すtraversal る前方探edges 索(Top-of Kronecker down) graph と後with 方探SCALE 索(Bottom-26 
up) 
Top%down Level Top-down Bottom-up Hybrid 
0 2 2,103,840,895 2 
1 66,206 1,766,587,029 66,206 
2 346,918,235 52,677,691 52,677,691 
3 1,727,195,615 12,820,854 12,820,854 
4 29,557,400 103,184 103,184 
5 82,357 21,467 21,467 
6 221 21,240 227 
Total 2,103,820,036 3,936,072,360 65,689,631 
Ratio 100.00% 187.09% 3.12% 
Hybrid-BFS reduces 
unnecessary edge traversals 
Bottom%up( 
Top%down 
Distance from source 
|V| = 226, |E| = 230 
= |E|
NUMA-optimized Dir. Opt. BFS 
50 
• Manages memory accesses on NUMA system 
40 
30 
20 
10 
0 
– Each NUMA node contains CPU socket and local memory 
2011 
NUMA%aware 
SC10 
Top%down 
SC12 
BigData13 
ISC14 
G500,ISC14 
GTEPS 
Reference 
NUMA-aware 
Dir.Opt. 
NUMA-Opt. 
NUMA-Opt. + Deg.aware 
NUMA-Opt. + Deg.aware + Vtx.Sort 
87M 800M 
5G 
11G 
29G 
42G 
⇥1 ⇥9 
⇥58 
⇥125 
⇥334 
⇥489 
• CPU: Intel Xeon 
• #sockets: 4 
• #cores: 32 or 40 
• RAM: 256GB or 512GB 
Top%down Top%down 
Bottom%up( 
CPU 
RAM 
System7configuration
NUMA-optimized Dir. Opt. BFS 
50 
• Manages memory accesses on NUMA system 
40 
30 
20 
10 
0 
– Each NUMA node contains CPU socket and local memory 
2011 
NUMA%aware Bottom%up( 
SC10 
Top%down 
SC12 
NUMA%aware 
Our results 
BigData13 
Partitioning 
0 1 2 3 
ISC14 
Adjacency 
Matrix 
Binding on NUMA 
processor core  L2 cache 
RAM RAM 
processor core  L2 cache 
processor core  L2 cache 
0th 3th 
1st 2nd 
RAM RAM 
RAM RAM 
shared L3 cache 8-core Xeon E5 4640 
RAM RAM 
RAM RAM 
shared L3 cache 8-core Xeon E5 4640 
shared L3 cache 8-core Xeon E5 4640 
G500,ISC14 
GTEPS 
Reference 
NUMA-aware 
Dir.Opt. 
NUMA-Opt. 
NUMA-Opt. + Deg.aware 
NUMA-Opt. + Deg.aware + Vtx.Sort 
87M 800M 
5G 
11G 
29G 
42G 
⇥1 ⇥9 
⇥58 
⇥125 
⇥334 
⇥489 
• CPU: Intel Xeon 
• #sockets: 4 
• #cores: 32 or 40 
• RAM: 256GB or 512GB 
Top%down Top%down 
Bottom%up( 
CPU 
RAM 
Top%down 
CPU 
RAM 
System7configuration
NUMA architecture 
• 4-way Intel Xeon E5-4640 (Sandybridge-EP) 
– 4 (# of CPU sockets) 
– 8 (# of physical cores per socket) 
– 2 (# of threads per core) 
NUMA node 
processor core  L2 cache 
Max. 
4 x 8 x 2 = 64 threads 
Memory access for Local RAM(Fast) 
RAM RAM 
RAM RAM 
shared L3 cache 8-core Xeon E5 4640 
NUMA node 
CPU socket(16 logical cores) 
+ Local RAM 
Memory access for Remote RAM(Slow) 
NUMA-aware (optimized) computation 
• Reduces and avoids memory accesses for Remote RAM
Flow of affinities using ULIBC 
ULIBC : Ubiquity Library for Intelligently Binding Cores 
– provides some APIs to utilizing processor topology easily. 
(Our(library)
Flow of affinities using ULIBC 
ULIBC : Ubiquity Library for Intelligently Binding Cores 
– provides some APIs to utilizing processor topology easily. 
1. Detects entire topology 
Use(Other( 
processes 
Cores 
CPU 0 P0, P4, P8, P12 
CPU 1 P1, P5, P9, P13 
CPU 2 P2, P6, P10, P14 
CPU 3 P3, P7, P11, P15 
(Our(library)
Flow of affinities using ULIBC 
ULIBC : Ubiquity Library for Intelligently Binding Cores 
– provides some APIs to utilizing processor topology easily. 
1. Detects entire topology 
2. Detects online (available) topology 
Cores 
CPU 1 P1, P5, P9, P13 
CPU 2 P2, P6, P10, P14 
Job manager (PBS) or 
numactl --cpunodebind=1,2 
Use(Other( 
processes 
Cores 
CPU 0 P0, P4, P8, P12 
CPU 1 P1, P5, P9, P13 
CPU 2 P2, P6, P10, P14 
CPU 3 P3, P7, P11, P15 
(Our(library)
Flow of affinities using ULIBC 
1. Detects entire topology 
2. Detects online (available) topology 
Cores 
CPU 1 P1, P5, P9, P13 
CPU 2 P2, P6, P10, P14 
Job manager (PBS) or 
numactl --cpunodebind=1,2 
3. Constructs ULIBC affinity 
core 0 
Threads 
Use(Other( 
processes 
Cores 
CPU 0 P0, P4, P8, P12 
CPU 1 P1, P5, P9, P13 
CPU 2 P2, P6, P10, P14 
CPU 3 P3, P7, P11, P15 
ULIBC_set_affinity_policy( 
7, SCATTER_MAPPING, THREAD_TO_CORE) 
core 3 
Local RAM 
RAM 
NUMA 0 0(P1), 2(P5), 4(P9), 6(P13) 
NUMA 1 1(P2), 3(P6), 5(P10) 
# of threads 
Scatter-type mapping 
Each thread binds each logical cores 
NUMA 0 
NUMA 1 
core 1c ore 2 
RAM 
ULIBC : Ubiquity Library for Intelligently Binding Cores 
– provides some APIs to utilizing processor topology easily. 
(Our(library)
NUMA-optimized BFS 
• The 1-D column-wise partitioning for adjacency matrix 
Partitioning 
Adjacency 0 1 2 3 
Matrix 
processor core  L2 cache 
0th 3th 
RAM RAM 
processor core  L2 cache 
RAM RAM 
RAM RAM 
shared L3 cache 8-core Xeon E5 4640 
RAM RAM 
RAM RAM 
1st 2nd 
• Local traversal and all-to-all comm. for each level 
Edge traversal on local RAM All-gathering of next frontier 
Each NUMA node searches unvisited 
vertices from duplicated frontier 
Out Out Out Out 
In In In In 
0 1 2 3 
processor core  L2 cache 
processor core  L2 cache 
RAM RAM 
RAM RAM 
RAM RAM 
processor core  L2 cache 
processor core  L2 cache 
RAM RAM 
RRAAMM RAMRAM 
RAM RAM 
RAM 
shared L3 cache 8-core Xeon E5 4640 
RAM RAM 
RAM RAM 
shared L3 cache 8-core Xeon E5 4640 
processor core  L2 cache 
RAM RAM 
RAM RAM 
shared L3 cache 8-core Xeon E5 4640 
processor core  L2 cache 
shared L3 cache 8-core Xeon E5 4640 
shared L3 cache 8-core Xeon E5 4640 
shared L3 cache 8-core Xeon E5 4640 
Construct duplicated frontiers from 
partial neighbors 
Local neighbors 
Duplicated 
frontiers 
processor core  L2 cache 
shared L3 cache 8-core Xeon E5 4640 
shared L3 cache 8-core Xeon E5 4640 
Binding 
Inner-NUMA-node Inter-NUMA-node
Degree-aware + NUMA-opt. + Dir. Opt. BFS 
50 
• Manages memory accesses on NUMA system 
40 
30 
20 
10 
0 
– Each NUMA node contains CPU socket and local memory 
2011 
NUMA%aware Bottom%up( 
SC10 
Top%down 
SC12 
NUMA%aware 
BigData13 
ISC14 
Our results 
1. Deleting isolated vertices 
Isolated 
2. Sorting adjacency vertices 
A(va) 
… … 
A(va) 
… … 
Sorted by degree 
G500,ISC14 
GTEPS 
Reference 
NUMA-aware 
Dir.Opt. 
NUMA-Opt. 
NUMA-Opt. + Deg.aware 
NUMA-Opt. + Deg.aware + Vtx.Sort 
87M 800M 
5G 
11G 
29G 
42G 
⇥1 ⇥9 
⇥58 
⇥125 
⇥334 
⇥489 
• CPU: Intel Xeon 
• #sockets: 4 
• #cores: 32 or 40 
• RAM: 256GB or 512GB 
Top%down Top%down 
Bottom%up( 
CPU 
RAM 
Top%down 
CPU 
RAM 
System7configuration
TEPS and TEPS/W on single-server 
for Graph500 for Green Graph500 
• Strong scaling for SCALE 27 
64 
32 
16 
8 
4 
2 
1 
4-way SB-EP based Xeon 
1 2 4 8 16 32 64 
64 
32 
16 
8 
4 
2 
1 
relative GTEPS 
relative MTEPS/W 
Number of threads 
GTEPS 
MTEPS/W 
29.03 GTEPS 
45.43 MTEPS/W 
Relative improvements 
Number of threads 
x 27.9 
x 12.6
SGI UV 2000 system 
• Shared-memory supercomputer 
– handle large memory space using thread parallel. 
– C/C++ with OpenMP/Pthreads (w/o MPI comm.) 
– cc-NUMA architecture system base on Intel Xeon 
• ISM has two Full-spec. UV 2000 
– 4 UV 2000 racks 
– Up to 2,560 cores and 64 TB memory 
• ISM, SGI, and us collaborate for Graph500 
– achieves the fastest of single-node in current list 
The Institute of Statistical Mathematics 
• Japan's national research institute for 
statistical science. 
UV2000 rack 
#1 system #2 system
SGI UV 2000 configuration 
• UV2000 has complex hardware topologies 
– Socket, Node, Cube, Inner-rack, and Inter-rack 
Node = 2 sockets Cube = 8 nodes Rack = 32 nodes 
CPU 
RAM 
CPU 
RAM 
× 4 = 
• We used NUMA-based flat parallelization 
– Each NUMA node contains a “Xeon CPU E5-2470 v2” and a “256 GB RAM” 
Node = 2 NUMA nodes Rack = 64 NUMA nodes 
CPU 
RAM 
× 64 = 
CPU 
RAM 
Cube = 16 NUMA nodes 
× 2 
CPU 
RAM 
× 16 
NUMAlink( 
6.7GB/s 
(20(cores,(512GB) (160(cores,(4TB) (640(cores,(16TB)
200 
150 
100 
50 
0 
Weak scaling on UV2000 
26 
(` = 1) 
Graph500 June list 
Fastest of single node 
131 GTEPS 
Most power-efficient 
commercial supercomputer 
June 2014 
27 
(` = 2) 
Weak scaling on UV 2000 
28 
(` = 4) 
29 
(` = 8) 
30 
(` = 16) 
31 
(` = 32) 
32 
(` = 64) 
33 
(` = 128) 
34 
(` = 256) 
GTEPS SCALE (` = #sockets) 
Inner%rack(comm. Inter%rack 
12.481 MTEPS 
= 131 GTEPS / 10.53 kW
The Graph500 List in June 2014 
hp://www.graph500.org 
• Measures performance using TEPS (# of Traversed edges 
per second) in graph traversal such as BFS 
Fastest.of. 
......single5node 
Fastest.of.. 
......single5server 
Distributed 
Memory 
Distributed 
Memory 
Shared 
Memory 
Distributed 
Memory 
Shared 
Memory 
Fastest.of.. 
mul:5node
The Green Graph500 List in June 2014 
http://green.graph500.org 
Big Data category ( SCALE 30) 
Small data category ( SCALE 29) 
SONY Xperia-Z1-SO-01F 
Measures power-efficiency using TEPS/W 
George Washington University’s 
Colonial 
is ranked 
No.1 
in the Small Data category of the Green Graph 500 
Ranking of Supercomputers with 
445.92 MTEPS/W on Scale 20 
on the third Green Graph 500 list published at the 
International Supercomputing Conference, June 23, 2014. 
Congratulations from the Green Graph 500 Chair 
Kyushu’s University 
GraphCREST-SandybridgeEP-2.4GHz 
is ranked 
No.1 
in the Big Data category of the Green Graph 500 
Ranking of Supercomputers with 
59.12 MTEPS/W on Scale 30 
on the third Green Graph 500 list published at the 
International Supercomputing Conference, June 23, 2014. 
Congratulations from the Green Graph 500 Chair 
Ours 
UV2000 
Ours 
4-way Xeon server 
TSUBAME-KFC
Weak scaling on UV2000 
200 
150 
100 
50 
0 
26 
(` = 1) 
June 2014 
Nov. 2014 
27 
(` = 2) 
Weak scaling on UV 2000 
28 
(` = 4) 
Fastest of single node 
in Graph500 June 
29 
(` = 8) 
131 GTEPS 
30 
(` = 16) 
31 
(` = 32) 
32 
(` = 64) 
New result 
174 GTEPS 
33 
Two racks 
(` = 128) 
34 
(` = 256) 
GTEPS SCALE (` = #sockets) 
Inner%rack(comm. Inter%rack(comm.

Weitere ähnliche Inhalte

Was ist angesagt?

Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Robo India
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentShubham Joshi
 
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Sean Moran
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy
 
Nmea Introduction
Nmea IntroductionNmea Introduction
Nmea IntroductionTom Chen
 
Development of Routing for Car Navigation Systems
Development of Routing for Car Navigation SystemsDevelopment of Routing for Car Navigation Systems
Development of Routing for Car Navigation SystemsAtsushi Koike
 
ktruss-short
ktruss-shortktruss-short
ktruss-shortJia Wang
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMLinaro
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designjpstudcorner
 
Area-Delay Efficient Binary Adders in QCA
Area-Delay Efficient Binary Adders in QCAArea-Delay Efficient Binary Adders in QCA
Area-Delay Efficient Binary Adders in QCAIJERA Editor
 
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless SystemsOrthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless SystemsT. E. BOGALE
 
On Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsOn Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsYu Liu
 
EE402B Radio Systems and Personal Communication Networks notes
EE402B Radio Systems and Personal Communication Networks notesEE402B Radio Systems and Personal Communication Networks notes
EE402B Radio Systems and Personal Communication Networks notesHaris Hassan
 
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...Andrea Tassi
 

Was ist angesagt? (20)

distance_matrix_ch
distance_matrix_chdistance_matrix_ch
distance_matrix_ch
 
Cnq1
Cnq1Cnq1
Cnq1
 
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.
 
IOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_presentIOEfficientParalleMatrixMultiplication_present
IOEfficientParalleMatrixMultiplication_present
 
An Overview of HDF-EOS (Part 1)
An Overview of HDF-EOS (Part 1)An Overview of HDF-EOS (Part 1)
An Overview of HDF-EOS (Part 1)
 
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)Regularised Cross-Modal Hashing (SIGIR'15 Poster)
Regularised Cross-Modal Hashing (SIGIR'15 Poster)
 
Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...Cycle’s topological optimizations and the iterative decoding problem on gener...
Cycle’s topological optimizations and the iterative decoding problem on gener...
 
Nmea Introduction
Nmea IntroductionNmea Introduction
Nmea Introduction
 
Development of Routing for Car Navigation Systems
Development of Routing for Car Navigation SystemsDevelopment of Routing for Car Navigation Systems
Development of Routing for Car Navigation Systems
 
ktruss-short
ktruss-shortktruss-short
ktruss-short
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate design
 
Area-Delay Efficient Binary Adders in QCA
Area-Delay Efficient Binary Adders in QCAArea-Delay Efficient Binary Adders in QCA
Area-Delay Efficient Binary Adders in QCA
 
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless SystemsOrthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
Orthogonal Faster than Nyquist Transmission for SIMO Wireless Systems
 
Chenchu
ChenchuChenchu
Chenchu
 
On Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and ExperimentsOn Extending MapReduce - Survey and Experiments
On Extending MapReduce - Survey and Experiments
 
EE402B Radio Systems and Personal Communication Networks notes
EE402B Radio Systems and Personal Communication Networks notesEE402B Radio Systems and Personal Communication Networks notes
EE402B Radio Systems and Personal Communication Networks notes
 
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...
 
pmux
pmuxpmux
pmux
 

Ähnlich wie Graph500 and Green Graph500 Benchmarks on SGI UV2000

Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkDatabricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
CPqD’s optical network - Miquel Garrich
CPqD’s optical network - Miquel GarrichCPqD’s optical network - Miquel Garrich
CPqD’s optical network - Miquel GarrichCPqD
 
cisco-n9k-c93108tc-ex-datasheet.pdf
cisco-n9k-c93108tc-ex-datasheet.pdfcisco-n9k-c93108tc-ex-datasheet.pdf
cisco-n9k-c93108tc-ex-datasheet.pdfHi-Network.com
 
cisco-n9k-c93180yc-ex-datasheet.pdf
cisco-n9k-c93180yc-ex-datasheet.pdfcisco-n9k-c93180yc-ex-datasheet.pdf
cisco-n9k-c93180yc-ex-datasheet.pdfHi-Network.com
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Databricks
 
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStoreBig Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStoreMatt Stubbs
 
E0364025031
E0364025031E0364025031
E0364025031theijes
 
P9 addressing signal_integrity_ in_ew_2015_final
P9 addressing signal_integrity_ in_ew_2015_finalP9 addressing signal_integrity_ in_ew_2015_final
P9 addressing signal_integrity_ in_ew_2015_finalAamir Habib
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Ravi Sony
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsPandey_G
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdlRavi Sony
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 

Ähnlich wie Graph500 and Green Graph500 Benchmarks on SGI UV2000 (20)

Challenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache SparkChallenging Web-Scale Graph Analytics with Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
CPqD’s optical network - Miquel Garrich
CPqD’s optical network - Miquel GarrichCPqD’s optical network - Miquel Garrich
CPqD’s optical network - Miquel Garrich
 
cisco-n9k-c93108tc-ex-datasheet.pdf
cisco-n9k-c93108tc-ex-datasheet.pdfcisco-n9k-c93108tc-ex-datasheet.pdf
cisco-n9k-c93108tc-ex-datasheet.pdf
 
cisco-n9k-c93180yc-ex-datasheet.pdf
cisco-n9k-c93180yc-ex-datasheet.pdfcisco-n9k-c93180yc-ex-datasheet.pdf
cisco-n9k-c93180yc-ex-datasheet.pdf
 
Xbfs HPDC'2019
Xbfs HPDC'2019Xbfs HPDC'2019
Xbfs HPDC'2019
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
Co315 part 1
Co315   part 1Co315   part 1
Co315 part 1
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
 
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStoreBig Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
 
E0364025031
E0364025031E0364025031
E0364025031
 
9.atmel
9.atmel9.atmel
9.atmel
 
P9 addressing signal_integrity_ in_ew_2015_final
P9 addressing signal_integrity_ in_ew_2015_finalP9 addressing signal_integrity_ in_ew_2015_final
P9 addressing signal_integrity_ in_ew_2015_final
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners
 
Binary Analysis - Luxembourg
Binary Analysis - LuxembourgBinary Analysis - Luxembourg
Binary Analysis - Luxembourg
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdl
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 

Kürzlich hochgeladen

Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptxogubuikealex
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerkumenegertelayegrama
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptxerickamwana1
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Sebastiano Panichella
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...Sebastiano Panichella
 
cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitysandeepnani2260
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRRsarwankumar4524
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per MVidyaAdsule1
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SESaleh Ibne Omar
 
A Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air CoolerA Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air Coolerenquirieskenstar
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityApp Ethena
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEMCharmi13
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxAsifArshad8
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxRoquia Salam
 

Kürzlich hochgeladen (17)

Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptx
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeeger
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
 
cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber security
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per M
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SE
 
A Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air CoolerA Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air Cooler
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEM
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptx
 

Graph500 and Green Graph500 Benchmarks on SGI UV2000

  • 1. Graph500 and Green Graph500 benchmarks on SGI UV2000 *Yuichiro Yasui & Katsuki Fujisawa Kyushu University and JST CREST SGI User group conference Nov. 17, 2014
  • 2. Outline 1. Graph processing for large-scale networks 2. Graph500 & Green Graph500 benchmarks 3. Our NUMA-optimized BFS algorithm 4. Numerical results on SGI UV 2000 Top%down Bottom%up( NUMA%aware CPU RAM
  • 3. Graph processing for Large scale networks • Large scale graphs in various fields – US Road network : 58 million edges – Twitter follow-ship : 1.47 billion edges – Neuronal network : 100 trillion edges • Fast and scalable graph processing by using HPC Neuronal network @ Human Brain Project 89 billion vertices & 100 trillion edges Cyber-security Twitter Social network US road network 24 million vertices & 58 million edges 15 billion log entries / day large 61.6 million vertices & 1.47 billion edges
  • 4. Graph analysis and important kernel BFS • The cycle of graph analysis for understanding real-networks 64 results Input parameters Input parameters Graph generation Graph Graph generation construction Graph construction BFS BFS Validation TEPS Input parameters Graph generation Graph construction Results • Transportation • Social network • Cyber-security • Bioinformatics graph processing BFS Validation Understanding Application field - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed - TEPS TEPS ratio 64 Iterations Relationships - SCALE - edgefactor ratio 64 Iterations graph - SCALE - edgefactor Validation Step1 Step2 Step3 constructing • concurrent search (breadth-first search) • optimization (single source shortest path) • edge-oriented (maximal independent set)
  • 5. Graph analysis and important kernel BFS • The cycle of graph analysis for understanding real-networks 64 results Input parameters Input parameters Graph generation Graph Graph generation construction Graph construction BFS BFS Validation TEPS Input parameters Graph generation Graph construction Results • Transportation • Social network • Cyber-security • Bioinformatics graph processing BFS Validation Understanding • concurrent search (breadth-first search) • optimization (single source shortest path) • edge-oriented (maximal independent set) Application field - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed - TEPS TEPS ratio 64 Iterations Relationships - SCALE - edgefactor ratio 64 Iterations graph - SCALE - edgefactor Validation Step1 Step2 Step3 Breadth-first search (BFS) constructing • One of most important and fundamental processing • Many algorithms and applications based on exists (Max.-flow and centrality) • low arithmetic intensity & irregular memory accesses. Source BFS Lv. 3 source Lv. 1 Lv. 2 Outputs:Distance (Lv.) and Predecessor for each Inputs:Graph, vertex from source and source vertex
  • 6. Twitter follow-ship network Twitter2009 • follow-ship network – #Users (#vertices) 41,652,230 – Follow-ships (#edges) 2,405,026,092 BFS result from User 21,804,357 Lv. #users ratio (%) percentile (%) 0 1 0.00 0.00 1 7 0.00 0.00 2 6,188 0.01 0.01 3 510,515 1.23 1.24 4 29,526,508 70.89 72.13 5 11,314,238 27.16 99.29 6 282,456 0.68 99.97 7 11536 0.03 100.00 8 673 0.00 100.00 9 68 0.00 100.00 10 19 0.00 100.00 11 10 0.00 100.00 12 5 0.00 100.00 13 2 0.00 100.00 14 2 0.00 100.00 15 2 0.00 100.00 Total 41,652,230 100.00 - This network excludes unconnected users The six-degrees of separation Our algorithm computes a BFS in 60 ms only
  • 7. Betweenness centrality (BC) σst : number of shortest (s, t)-paths σst(v) : number of shortest (s, t)-paths passing through vertex CB(Highway Bridge • Definition CB(v) = ! s!v!t∈V σst(v) σst v) = ! s!v!t∈V σst(v) σst σst : number of shortest (s, t)-paths σst(v) : number of shortest (s, t)-paths passing through vertex v # of (s,t)-shortest paths # of (s,t)-shortest paths passing throw v Osaka road network 13,076 vertices and 40,528 edges • BC measures important vertices and edges without coordinates High(score(vertex/edge(=(Important(place( c.g.)(Highway,(Bridge • BC requires the all-to-all shortest paths • BFS => one-to-all • <#vertices> times BFS => all-to-all =>(13,076(times(BFS(computations
  • 8. Graph500 Benchmark www.graph500.org • Measures a performance of irregular memory accesses • TEPS score (# of Traversed edges per second) in a BFS Input parameters for problem size SCALE & edgefactor (=16) Input parameters Graph generation Graph construction Results Input parameters Graph generation Graph construction Results BFS Validation Median TEPS 1. Generation SCALE edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS BFS Validation BFS Validation TEPS ratio 64 Iterations - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS TEPS ratio 64 Iterations - SCALE - edgefactor - SCALE - edgefactor - BFS - Traversed - TEPS Input parameters Graph generation Graph construction TEPS ratio 64 Iterations 2. Construction 3. BFS x 64 x 64 TEPS ratio • Generates synthetic scale-free network with 2SCALE vertices and 2SCALE×edgefactor edges by using SCALE-times the Rursive Kronecker products G1 G2 G3 G4 Kronecker graph
  • 9. Green Graph500 Benchmark http://green.graph500.org • Measures power-efficient using TEPS/W score • Our results on various systems such as SGI UV series and Xeon servers, Android devices parameters Graph generation Graph construction Results Median TEPS 1. Generation BFS Validation Input parameters Graph generation Graph construction BFS Validation Results Input parameters Graph generation Graph construction Results - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS BFS Validation TEPS ratio 64 Iterations - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS TEPS ratio 64 Iterations SCALE edgefactor - SCALE - edgefactor - BFS Time - Traversed edges - TEPS TEPS ratio 64 Iterations 3. BFS phase 2. Construction x 64 TEPS ratio Watt TEPS/W Power measurement Green Graph500 Graph500
  • 10. Target networks 45 40 35 30 25 20 Graph500 (Small) Graph500 (Mini) twitter-rv USA-road-d.USA.gr USA-road-d.LKS.gr soc-LiveJournal1 cit-Patents Graph500 (Huge) Graph500 (Large) 15 20 25 30 35 40 45 log2(m) log2(n) USA-road-d.NY.gr Human Project Graph500 (Toy) Graph500 (Medium) 1blillion 1trillion 1blillion 1trillion Human Brain 89 B, 100 T Twitter2009 #(of(vertices((in(logscale) #(of(edges((in(logscale) US road network
  • 11. Target networks on Smartphone 45 40 35 30 25 20 Graph500 (Small) Graph500 (Mini) twitter-rv USA-road-d.USA.gr USA-road-d.LKS.gr soc-LiveJournal1 cit-Patents Graph500 (Huge) Graph500 (Large) Smartphone 15 20 25 30 35 40 45 log2(m) log2(n) USA-road-d.NY.gr Human Project Graph500 (Toy) Graph500 (Medium) 1blillion 1trillion 1blillion 1trillion Human Brain 89 B, 100 T Twitter2009 Graph500 (SCALE20) ・Smartphone (4 cores) US road network #(of(vertices((in(logscale) #(of(edges((in(logscale)
  • 12. Target networks on Single-server 45 40 35 30 25 20 Graph500 (Small) Graph500 (Mini) twitter-rv USA-road-d.USA.gr USA-road-d.LKS.gr soc-LiveJournal1 cit-Patents Graph500 (Huge) Graph500 (Large) 15 20 25 30 35 40 45 log2(m) log2(n) USA-road-d.NY.gr Human Project Graph500 (Toy) Graph500 (Medium) 1blillion 1trillion 1blillion 1trillion Human Brain 89 B, 100 T Twitter2009 Graph500 (SCALE29) ・4-way Intel Xeon (64 cores) Graph500 (SCALE20) ・Smartphone (4 cores) Single server Smartphone US road network #(of(vertices((in(logscale) #(of(edges((in(logscale)
  • 13. Target networks on UV2000 45 40 35 30 25 20 Graph500 (Small) Graph500 (Mini) twitter-rv USA-road-d.USA.gr USA-road-d.LKS.gr soc-LiveJournal1 cit-Patents Graph500 (Huge) Graph500 (Large) 15 20 25 30 35 40 45 log2(m) log2(n) USA-road-d.NY.gr Human Project Graph500 (Toy) Graph500 (Medium) 1blillion 1trillion 1blillion 1trillion Human Brain 89 B, 100 T Twitter2009 Graph500 (SCALE29) ・4-way Intel Xeon (64 cores) Graph500 (SCALE32) ・UV2000 (1rack, 640 cores) Graph500 (SCALE20) ・Smartphone (4 cores) Single server UV 2000 Smartphone US road network #(of(vertices((in(logscale) #(of(edges((in(logscale)
  • 14. Target networks on Supercomputer 45 40 35 30 25 20 Graph500 (Small) Twitter2009 Graph500 (Mini) twitter-rv USA-road-d.USA.gr USA-road-d.LKS.gr soc-LiveJournal1 cit-Patents Graph500 (Huge) Graph500 (Large) Graph500 (SCALE40) ・BlueGene/Q (64K nodes) ・K computer (64K nodes) 15 20 25 30 35 40 45 log2(m) log2(n) USA-road-d.NY.gr Human Project Graph500 (Toy) Graph500 (Medium) 1blillion 1trillion 1blillion 1trillion Human Brain 89 B, 100 T US road network Graph500 (SCALE29) ・4-way Intel Xeon (64 cores) Graph500 (SCALE32) ・UV2000 (1rack, 640 cores) Graph500 (SCALE20) ・Smartphone (4 cores) Single server UV 2000 K and Sequoia Smartphone #(of(vertices((in(logscale) #(of(edges((in(logscale)
  • 15. Problem and Our motivation • Does UV2000 obtain a high-performance without MPI? Thread ? MPI Thread << MPI K computer Thread > MPI Single-server UV2000 1 4 32 640 1280 512K # of cores
  • 16. Problem and Our motivation • Does UV2000 obtain a high-performance without MPI? Thread ≈ MPI Thread << MPI K computer Thread > MPI Single-server UV2000 1 4 32 640 1280 512K # of cores • Exploiting Algorithm on NUMA and cc-NUMA system – Automatic processor topology detection Node = 2 sockets Cube = 8 nodes Rack = 32 nodes CPU RAM CPU RAM
  • 17. × 4 = – Affinity configurations for running threads and allocating memory processor core L2 cache 0th 3th RAM RAM processor core L2 cache processor core L2 cache RAM RAM RAM RAM shared L3 cache 8-core Xeon E5 4640 RAM RAM RAM RAM 1st 2nd shared L3 cache 8-core Xeon E5 4640 shared L3 cache 8-core Xeon E5 4640 Partitioning Binding Adjacency 0 1 2 3 Matrix YES
  • 18. Level-synchronized parallel BFS (Top-down) • Started from source vertex and executes following two phases for each level iterations (timed).: This step iterates the timed untimed verify-phase 64 times. The BFS-phase BFS for each source, and the verify-phase output of the BFS. benchmark is based on the TEPS ratio, which is Traversal finds neighbors QN from current frontier QF given graph and the BFS output. Submission benchmark must report five TEPS ratios: the quartile, median, third quartile, and maxi-mum. Unvisited adjacency vertices( PARALLEL BFS ALGORITHM synchronized Parallel BFS QN the input of a BFS is a graph G = (V,E) set of vertices V and a set of edges E. of G are contained as pairs (v,w), where set of edges E corresponds to a set of where an adjacency list A(v) contains edges (v,w) ∈ E for each vertex v ∈ V. A various edges spanning all other vertices the source vertex s ∈ V in a given graph predecessor map π, which is a map from its parent. When the predecessor map π(v) Algorithm 1: Level-synchronized Parallel BFS. Input : G = (V,A) : unweighted directed graph. s : source vertex. Variables: QF : frontier queue. QN : neighbor queue. visited : vertices already visited. Output : π(v) : predecessor map of BFS tree. 1 π(v)←−1, #v ∈ V 2 π(s) ← s 3 visited ← {s} 4 QF ← {s} 5 QN ← ∅ 6 while QF̸= ∅ do 7 for v ∈ QF in parallel do 8 for w ∈ A(v) do 9 if w̸∈ visited atomic then 10 π(w) ← v 11 visited ← visited ∪ {w} 12 QN ← QN ∪ {w} 13 QF ← QN 14 QN ← ∅ Traversal Swap Frontier Neighbor Level k Level k+1 Q F Swap exchanges the frontier QF and the neighbors QN for next level
  • 19. Direction-optimizing BFS Chooses one from Top-down or Bottom-up Beamer2012 @ SC2012 前方探索と後方探索でのデータアクセスの観察 Bottom-up algorithm • Efficient for large-frontier • Uses in-coming edges • 後方探索でのデータの書込み Candidates of neighbors 前方探索と後方探索でのデータアクセスの観察 Top-down algorithm • Efficient for small-frontier • Uses out-going edges • 前方探索でのデータの書込み • 前方探索でのデータの書込み Outgoing edges Incoming v Frontier v → w Level7k v → w v Neighbors Level7k+1 w w Input : Directed graph G = (V, AF ), Queue QF Data : Queue QN, visited, Tree π(v) QN ← ∅ for v ∈ QF in parallel do Input : Directed graph G = (V, AF ), Queue QF Data : Queue QN, visited, Tree π(v) QN ← ∅ for v ∈ QF in parallel do Candidates of edges Current for for w ∈ w AF ∈ AF (v) (do v) do if w if ! w visited ! visited atomic atomic then then π(w) π(w) ← ← v v visited visited ←←visited visited ∪ ∪ QN {w} {w} QN ← QN QN ∪ {w} QF ← QN ← ∪ {w} どちらもに関する変数とに書込みを行っているは点番号の参照• 後方探索でのデータの書込み v w Frontier Level7k w w → v v v w Level7k+1 neighbors Input : Directed graph G = (V, AB), Queue QF Data : Queue QN, visited, Tree π(v) QN ← ∅ for w do Input : Directed graph G = (V, AB), Queue QF Data Tree π(v) QN for w ∈ V visited in parallel do for v ∈ AB(w) do if v ∈ QF then π(w) ← v visited ←visited ∪ {w} QN ← QN ∪ {w} break for QF ← QN QF ← QN π(w) v visited ←visited ∪ {w} QN ← QN ∪ {w} break QF ← QN frontier Unvisited neighbors neighbors Current frontier Skips unnecessary edge traversal • どちらもw に関する変数π(w) とvisited に書込みを行っている(v は点番号の参照)
  • 20. Direction-optimizing BFS Chooses one from Top-down or Bottom-up Beamer2012 @ SC2012 Small frontier large frontier 探索に# 対of すtraversal る前方探edges 索(Top-of Kronecker down) graph と後with 方探SCALE 索(Bottom-26 up) Top%down Level Top-down Bottom-up Hybrid 0 2 2,103,840,895 2 1 66,206 1,766,587,029 66,206 2 346,918,235 52,677,691 52,677,691 3 1,727,195,615 12,820,854 12,820,854 4 29,557,400 103,184 103,184 5 82,357 21,467 21,467 6 221 21,240 227 Total 2,103,820,036 3,936,072,360 65,689,631 Ratio 100.00% 187.09% 3.12% Hybrid-BFS reduces unnecessary edge traversals Bottom%up( Top%down Distance from source |V| = 226, |E| = 230 = |E|
  • 21. NUMA-optimized Dir. Opt. BFS 50 • Manages memory accesses on NUMA system 40 30 20 10 0 – Each NUMA node contains CPU socket and local memory 2011 NUMA%aware SC10 Top%down SC12 BigData13 ISC14 G500,ISC14 GTEPS Reference NUMA-aware Dir.Opt. NUMA-Opt. NUMA-Opt. + Deg.aware NUMA-Opt. + Deg.aware + Vtx.Sort 87M 800M 5G 11G 29G 42G ⇥1 ⇥9 ⇥58 ⇥125 ⇥334 ⇥489 • CPU: Intel Xeon • #sockets: 4 • #cores: 32 or 40 • RAM: 256GB or 512GB Top%down Top%down Bottom%up( CPU RAM System7configuration
  • 22. NUMA-optimized Dir. Opt. BFS 50 • Manages memory accesses on NUMA system 40 30 20 10 0 – Each NUMA node contains CPU socket and local memory 2011 NUMA%aware Bottom%up( SC10 Top%down SC12 NUMA%aware Our results BigData13 Partitioning 0 1 2 3 ISC14 Adjacency Matrix Binding on NUMA processor core L2 cache RAM RAM processor core L2 cache processor core L2 cache 0th 3th 1st 2nd RAM RAM RAM RAM shared L3 cache 8-core Xeon E5 4640 RAM RAM RAM RAM shared L3 cache 8-core Xeon E5 4640 shared L3 cache 8-core Xeon E5 4640 G500,ISC14 GTEPS Reference NUMA-aware Dir.Opt. NUMA-Opt. NUMA-Opt. + Deg.aware NUMA-Opt. + Deg.aware + Vtx.Sort 87M 800M 5G 11G 29G 42G ⇥1 ⇥9 ⇥58 ⇥125 ⇥334 ⇥489 • CPU: Intel Xeon • #sockets: 4 • #cores: 32 or 40 • RAM: 256GB or 512GB Top%down Top%down Bottom%up( CPU RAM Top%down CPU RAM System7configuration
  • 23. NUMA architecture • 4-way Intel Xeon E5-4640 (Sandybridge-EP) – 4 (# of CPU sockets) – 8 (# of physical cores per socket) – 2 (# of threads per core) NUMA node processor core L2 cache Max. 4 x 8 x 2 = 64 threads Memory access for Local RAM(Fast) RAM RAM RAM RAM shared L3 cache 8-core Xeon E5 4640 NUMA node CPU socket(16 logical cores) + Local RAM Memory access for Remote RAM(Slow) NUMA-aware (optimized) computation • Reduces and avoids memory accesses for Remote RAM
  • 24. Flow of affinities using ULIBC ULIBC : Ubiquity Library for Intelligently Binding Cores – provides some APIs to utilizing processor topology easily. (Our(library)
  • 25. Flow of affinities using ULIBC ULIBC : Ubiquity Library for Intelligently Binding Cores – provides some APIs to utilizing processor topology easily. 1. Detects entire topology Use(Other( processes Cores CPU 0 P0, P4, P8, P12 CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 CPU 3 P3, P7, P11, P15 (Our(library)
  • 26. Flow of affinities using ULIBC ULIBC : Ubiquity Library for Intelligently Binding Cores – provides some APIs to utilizing processor topology easily. 1. Detects entire topology 2. Detects online (available) topology Cores CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 Job manager (PBS) or numactl --cpunodebind=1,2 Use(Other( processes Cores CPU 0 P0, P4, P8, P12 CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 CPU 3 P3, P7, P11, P15 (Our(library)
  • 27. Flow of affinities using ULIBC 1. Detects entire topology 2. Detects online (available) topology Cores CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 Job manager (PBS) or numactl --cpunodebind=1,2 3. Constructs ULIBC affinity core 0 Threads Use(Other( processes Cores CPU 0 P0, P4, P8, P12 CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 CPU 3 P3, P7, P11, P15 ULIBC_set_affinity_policy( 7, SCATTER_MAPPING, THREAD_TO_CORE) core 3 Local RAM RAM NUMA 0 0(P1), 2(P5), 4(P9), 6(P13) NUMA 1 1(P2), 3(P6), 5(P10) # of threads Scatter-type mapping Each thread binds each logical cores NUMA 0 NUMA 1 core 1c ore 2 RAM ULIBC : Ubiquity Library for Intelligently Binding Cores – provides some APIs to utilizing processor topology easily. (Our(library)
  • 28. NUMA-optimized BFS • The 1-D column-wise partitioning for adjacency matrix Partitioning Adjacency 0 1 2 3 Matrix processor core L2 cache 0th 3th RAM RAM processor core L2 cache RAM RAM RAM RAM shared L3 cache 8-core Xeon E5 4640 RAM RAM RAM RAM 1st 2nd • Local traversal and all-to-all comm. for each level Edge traversal on local RAM All-gathering of next frontier Each NUMA node searches unvisited vertices from duplicated frontier Out Out Out Out In In In In 0 1 2 3 processor core L2 cache processor core L2 cache RAM RAM RAM RAM RAM RAM processor core L2 cache processor core L2 cache RAM RAM RRAAMM RAMRAM RAM RAM RAM shared L3 cache 8-core Xeon E5 4640 RAM RAM RAM RAM shared L3 cache 8-core Xeon E5 4640 processor core L2 cache RAM RAM RAM RAM shared L3 cache 8-core Xeon E5 4640 processor core L2 cache shared L3 cache 8-core Xeon E5 4640 shared L3 cache 8-core Xeon E5 4640 shared L3 cache 8-core Xeon E5 4640 Construct duplicated frontiers from partial neighbors Local neighbors Duplicated frontiers processor core L2 cache shared L3 cache 8-core Xeon E5 4640 shared L3 cache 8-core Xeon E5 4640 Binding Inner-NUMA-node Inter-NUMA-node
  • 29. Degree-aware + NUMA-opt. + Dir. Opt. BFS 50 • Manages memory accesses on NUMA system 40 30 20 10 0 – Each NUMA node contains CPU socket and local memory 2011 NUMA%aware Bottom%up( SC10 Top%down SC12 NUMA%aware BigData13 ISC14 Our results 1. Deleting isolated vertices Isolated 2. Sorting adjacency vertices A(va) … … A(va) … … Sorted by degree G500,ISC14 GTEPS Reference NUMA-aware Dir.Opt. NUMA-Opt. NUMA-Opt. + Deg.aware NUMA-Opt. + Deg.aware + Vtx.Sort 87M 800M 5G 11G 29G 42G ⇥1 ⇥9 ⇥58 ⇥125 ⇥334 ⇥489 • CPU: Intel Xeon • #sockets: 4 • #cores: 32 or 40 • RAM: 256GB or 512GB Top%down Top%down Bottom%up( CPU RAM Top%down CPU RAM System7configuration
  • 30. TEPS and TEPS/W on single-server for Graph500 for Green Graph500 • Strong scaling for SCALE 27 64 32 16 8 4 2 1 4-way SB-EP based Xeon 1 2 4 8 16 32 64 64 32 16 8 4 2 1 relative GTEPS relative MTEPS/W Number of threads GTEPS MTEPS/W 29.03 GTEPS 45.43 MTEPS/W Relative improvements Number of threads x 27.9 x 12.6
  • 31. SGI UV 2000 system • Shared-memory supercomputer – handle large memory space using thread parallel. – C/C++ with OpenMP/Pthreads (w/o MPI comm.) – cc-NUMA architecture system base on Intel Xeon • ISM has two Full-spec. UV 2000 – 4 UV 2000 racks – Up to 2,560 cores and 64 TB memory • ISM, SGI, and us collaborate for Graph500 – achieves the fastest of single-node in current list The Institute of Statistical Mathematics • Japan's national research institute for statistical science. UV2000 rack #1 system #2 system
  • 32. SGI UV 2000 configuration • UV2000 has complex hardware topologies – Socket, Node, Cube, Inner-rack, and Inter-rack Node = 2 sockets Cube = 8 nodes Rack = 32 nodes CPU RAM CPU RAM × 4 = • We used NUMA-based flat parallelization – Each NUMA node contains a “Xeon CPU E5-2470 v2” and a “256 GB RAM” Node = 2 NUMA nodes Rack = 64 NUMA nodes CPU RAM × 64 = CPU RAM Cube = 16 NUMA nodes × 2 CPU RAM × 16 NUMAlink( 6.7GB/s (20(cores,(512GB) (160(cores,(4TB) (640(cores,(16TB)
  • 33. 200 150 100 50 0 Weak scaling on UV2000 26 (` = 1) Graph500 June list Fastest of single node 131 GTEPS Most power-efficient commercial supercomputer June 2014 27 (` = 2) Weak scaling on UV 2000 28 (` = 4) 29 (` = 8) 30 (` = 16) 31 (` = 32) 32 (` = 64) 33 (` = 128) 34 (` = 256) GTEPS SCALE (` = #sockets) Inner%rack(comm. Inter%rack 12.481 MTEPS = 131 GTEPS / 10.53 kW
  • 34. The Graph500 List in June 2014 hp://www.graph500.org • Measures performance using TEPS (# of Traversed edges per second) in graph traversal such as BFS Fastest.of. ......single5node Fastest.of.. ......single5server Distributed Memory Distributed Memory Shared Memory Distributed Memory Shared Memory Fastest.of.. mul:5node
  • 35. The Green Graph500 List in June 2014 http://green.graph500.org Big Data category ( SCALE 30) Small data category ( SCALE 29) SONY Xperia-Z1-SO-01F Measures power-efficiency using TEPS/W George Washington University’s Colonial is ranked No.1 in the Small Data category of the Green Graph 500 Ranking of Supercomputers with 445.92 MTEPS/W on Scale 20 on the third Green Graph 500 list published at the International Supercomputing Conference, June 23, 2014. Congratulations from the Green Graph 500 Chair Kyushu’s University GraphCREST-SandybridgeEP-2.4GHz is ranked No.1 in the Big Data category of the Green Graph 500 Ranking of Supercomputers with 59.12 MTEPS/W on Scale 30 on the third Green Graph 500 list published at the International Supercomputing Conference, June 23, 2014. Congratulations from the Green Graph 500 Chair Ours UV2000 Ours 4-way Xeon server TSUBAME-KFC
  • 36. Weak scaling on UV2000 200 150 100 50 0 26 (` = 1) June 2014 Nov. 2014 27 (` = 2) Weak scaling on UV 2000 28 (` = 4) Fastest of single node in Graph500 June 29 (` = 8) 131 GTEPS 30 (` = 16) 31 (` = 32) 32 (` = 64) New result 174 GTEPS 33 Two racks (` = 128) 34 (` = 256) GTEPS SCALE (` = #sockets) Inner%rack(comm. Inter%rack(comm.
  • 37. Conclusion • UV 2000 with NUMA-based thread-parallelization – Scalable for irregular memory access computation • Graph500/Green Graph500 on UV 2000 – 131 GTEPS with 640 threads – The fastest of single node entries – The most power-efficient of commercial supercomputers 174 GTEPS for SCALE 33 with 1,280 threads SGI UV2000 CPU : 64 CPUs per rack RAM : 16 TB per rack Graph500 Green Graph500 • ULIBC will be available at https://bitbucket.org/yuichiro_yasui/ulibc