Graph500 and Green Graph500 Benchmarks on SGI UV2000

Graph500 and Green Graph500
benchmarks on SGI UV2000
*Yuichiro Yasui & Katsuki Fujisawa
Kyushu University and JST CREST
SGI User group conference
Nov. 17, 2014

Outline
1. Graph processing for large-scale networks
2. Graph500 & Green Graph500 benchmarks
3. Our NUMA-optimized BFS algorithm
4. Numerical results on SGI UV 2000
Top%down Bottom%up(
NUMA%aware
CPU
RAM

Graph processing for Large scale networks
• Large scale graphs in various fields
– US Road network : 58 million edges
– Twitter follow-ship : 1.47 billion edges
– Neuronal network : 100 trillion edges
• Fast and scalable graph processing by using HPC
Neuronal network @ Human Brain Project
89 billion vertices & 100 trillion edges
Cyber-security
Twitter
Social network
US road network
24 million vertices & 58 million edges 15 billion log entries / day
large
61.6 million vertices
& 1.47 billion edges

Graph analysis and important kernel BFS
• The cycle of graph analysis for understanding real-networks
64 results
Input parameters Input parameters Graph generation Graph Graph generation construction Graph construction BFS BFS Validation
TEPS
Input parameters Graph generation Graph construction Results
• Transportation
• Social network
• Cyber-security
• Bioinformatics
graph
processing
BFS Validation
Understanding
Application field
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed - TEPS
TEPS
ratio
64 Iterations
Relationships
- SCALE
- edgefactor
ratio
64 Iterations
graph
- SCALE
- edgefactor
Validation
Step1
Step2
Step3
constructing
• concurrent search (breadth-first search)
• optimization (single source shortest path)
• edge-oriented (maximal independent set)

Graph analysis and important kernel BFS
• The cycle of graph analysis for understanding real-networks
64 results
Input parameters Input parameters Graph generation Graph Graph generation construction Graph construction BFS BFS Validation
TEPS
• Transportation
• Social network
• Cyber-security
• Bioinformatics
graph
processing
BFS Validation
Understanding
• concurrent search (breadth-first search)
• optimization (single source shortest path)
• edge-oriented (maximal independent set)
Application field
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed - TEPS
TEPS
ratio
64 Iterations
Relationships
- SCALE
- edgefactor
ratio
64 Iterations
graph
- SCALE
- edgefactor
Validation
Step1
Step2
Step3
Breadth-first search (BFS)
constructing
• One of most important and fundamental processing
• Many algorithms and applications based on exists (Max.-flow and centrality)
• low arithmetic intensity & irregular memory accesses.
Source
BFS Lv. 3
source Lv. 1 Lv. 2
Outputs：Distance (Lv.)
and Predecessor for each
Inputs：Graph, vertex from source
and source vertex

Twitter follow-ship network
Twitter2009
• follow-ship network
– #Users (#vertices) 41,652,230
– Follow-ships (#edges) 2,405,026,092
BFS result from User 21,804,357
Lv. #users ratio (%) percentile (%)
0 1 0.00 0.00
1 7 0.00 0.00
2 6,188 0.01 0.01
3 510,515 1.23 1.24
4 29,526,508 70.89 72.13
5 11,314,238 27.16 99.29
6 282,456 0.68 99.97
7 11536 0.03 100.00
8 673 0.00 100.00
9 68 0.00 100.00
10 19 0.00 100.00
11 10 0.00 100.00
12 5 0.00 100.00
13 2 0.00 100.00
14 2 0.00 100.00
15 2 0.00 100.00
Total 41,652,230 100.00 -
This network excludes unconnected users
The six-degrees of
separation
Our algorithm
computes a BFS in 60 ms only

Betweenness centrality (BC)
σst : number of shortest (s, t)-paths
σst(v) : number of shortest (s, t)-paths passing through vertex CB(Highway
Bridge
• Definition
CB(v) =
!
s!v!t∈V
σst(v)
σst
v) =
!
s!v!t∈V
σst(v)
σst
σst : number of shortest (s, t)-paths
σst(v) : number of shortest (s, t)-paths passing through vertex v
# of (s,t)-shortest paths
# of (s,t)-shortest paths
passing throw v
Osaka road network
13,076 vertices and 40,528 edges
• BC measures important vertices
and edges without coordinates
High(score(vertex/edge(＝(Important(place(
c.g.)(Highway,(Bridge
• BC requires the all-to-all shortest paths
• BFS => one-to-all
• <#vertices> times BFS => all-to-all
=>(13,076(times(BFS(computations

Graph500 Benchmark
www.graph500.org
• Measures a performance of irregular memory accesses
• TEPS score (# of Traversed edges per second) in a BFS
Input parameters for problem size
SCALE & edgefactor (=16)
BFS Validation
Median
TEPS
1. Generation
SCALE
edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
BFS Validation
BFS Validation
TEPS
ratio
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
TEPS
ratio
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS - Traversed - TEPS
Input parameters Graph generation Graph construction TEPS
ratio
64 Iterations
2. Construction 3. BFS x 64
x 64
TEPS ratio
• Generates synthetic scale-free network with 2SCALE vertices and
2SCALE×edgefactor edges by using SCALE-times the Rursive
Kronecker products
G1 G2 G3 G4
Kronecker graph

Green Graph500 Benchmark
http://green.graph500.org
• Measures power-efficient using TEPS/W score
• Our results on various systems such as SGI UV series
and Xeon servers, Android devices
parameters Graph generation Graph construction Results
Median
TEPS
1. Generation
BFS Validation
Input parameters Graph generation Graph construction BFS Validation
Results
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
BFS Validation
TEPS
ratio
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
TEPS
ratio
64 Iterations
SCALE
edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
TEPS
ratio
64 Iterations
3. BFS phase
2. Construction x 64
TEPS ratio
Watt
TEPS/W
Power measurement Green Graph500
Graph500

Target networks
45
40
35
30
25
20
Graph500 (Small)
Graph500 (Mini)
twitter-rv
USA-road-d.USA.gr
USA-road-d.LKS.gr
soc-LiveJournal1
cit-Patents
Graph500 (Huge)
Graph500 (Large)
15 20 25 30 35 40 45
log2(m)
log2(n)
USA-road-d.NY.gr
Human Project
Graph500 (Toy)
Graph500 (Medium)
1blillion 1trillion
1blillion
1trillion
Human Brain
89 B, 100 T
Twitter2009
#(of(vertices((in(logscale)
#(of(edges((in(logscale)
US road
network

Target networks on Smartphone
45
40
35
30
25
20
Graph500 (Small)
Graph500 (Mini)
twitter-rv
USA-road-d.USA.gr
USA-road-d.LKS.gr
soc-LiveJournal1
cit-Patents
Graph500 (Huge)
Graph500 (Large)
Smartphone
15 20 25 30 35 40 45
log2(m)
log2(n)
USA-road-d.NY.gr
Human Project
Graph500 (Toy)
Graph500 (Medium)
1blillion 1trillion
1blillion
1trillion
Human Brain
89 B, 100 T
Twitter2009
Graph500 (SCALE20)
・Smartphone (4 cores)
US road
network

Target networks on Single-server
45
40
35
30
25
20
Graph500 (Small)
Graph500 (Mini)
twitter-rv
USA-road-d.USA.gr
USA-road-d.LKS.gr
soc-LiveJournal1
cit-Patents
Graph500 (Huge)
Graph500 (Large)
15 20 25 30 35 40 45
log2(m)
log2(n)
USA-road-d.NY.gr
Human Project
Graph500 (Toy)
Graph500 (Medium)
1blillion 1trillion
1blillion
1trillion
Human Brain
89 B, 100 T
Twitter2009
Graph500 (SCALE29)
・4-way Intel Xeon (64 cores)
Graph500 (SCALE20)
Single server
Smartphone
US road
network

Target networks on UV2000
45
40
35
30
25
20
Graph500 (Small)
Graph500 (Mini)
twitter-rv
USA-road-d.USA.gr
USA-road-d.LKS.gr
soc-LiveJournal1
cit-Patents
Graph500 (Huge)
Graph500 (Large)
15 20 25 30 35 40 45
log2(m)
log2(n)
USA-road-d.NY.gr
Human Project
Graph500 (Toy)
Graph500 (Medium)
1blillion 1trillion
1blillion
1trillion
Human Brain
89 B, 100 T
Twitter2009
Graph500 (SCALE29)
Graph500 (SCALE32)
・UV2000 (1rack, 640 cores)
Graph500 (SCALE20)
Single server
UV 2000
Smartphone
US road
network

Target networks on Supercomputer
45
40
35
30
25
20
Graph500 (Small)
Twitter2009
Graph500 (Mini)
twitter-rv
USA-road-d.USA.gr
USA-road-d.LKS.gr
soc-LiveJournal1
cit-Patents
Graph500 (Huge)
Graph500 (Large)
Graph500 (SCALE40)
・BlueGene/Q (64K nodes)
・K computer (64K nodes)
15 20 25 30 35 40 45
log2(m)
log2(n)
USA-road-d.NY.gr
Human Project
Graph500 (Toy)
Graph500 (Medium)
1blillion 1trillion
1blillion
1trillion
Human Brain
89 B, 100 T
US road
network
Graph500 (SCALE29)
Graph500 (SCALE32)
・UV2000 (1rack, 640 cores)
Graph500 (SCALE20)
Single server
UV 2000
K and Sequoia
Smartphone

Problem and Our motivation
• Does UV2000 obtain a high-performance without MPI?
Thread ? MPI
Thread << MPI
K computer
Thread > MPI
Single-server UV2000
1 4 32 640 1280 512K # of cores

Problem and Our motivation
• Does UV2000 obtain a high-performance without MPI?
Thread ≈ MPI Thread << MPI
K computer
Thread > MPI
Single-server UV2000
1 4 32 640 1280 512K # of cores
• Exploiting Algorithm on NUMA and cc-NUMA system
– Automatic processor topology detection
Node = 2 sockets Cube = 8 nodes Rack = 32 nodes
CPU
RAM
CPU
RAM

× 4 = – Affinity configurations for running threads and allocating memory
processor core L2 cache
0th 3th
RAM RAM
RAM RAM
RAM RAM
shared L3 cache 8-core Xeon E5 4640
RAM RAM
RAM RAM
1st 2nd
Partitioning Binding
Adjacency 0 1 2 3
Matrix
YES

Level-synchronized parallel BFS (Top-down)
• Started from source vertex
and executes following two
phases for each level
iterations (timed).: This step iterates the timed
untimed verify-phase 64 times. The BFS-phase
BFS for each source, and the verify-phase
output of the BFS.
benchmark is based on the TEPS ratio, which is
Traversal finds neighbors QN
from current frontier QF
given graph and the BFS output. Submission
benchmark must report five TEPS ratios: the
quartile, median, third quartile, and maxi-mum.
Unvisited adjacency vertices(
PARALLEL BFS ALGORITHM
synchronized Parallel BFS
QN
the input of a BFS is a graph G = (V,E)
set of vertices V and a set of edges E.
of G are contained as pairs (v,w), where
set of edges E corresponds to a set of
where an adjacency list A(v) contains
edges (v,w) ∈ E for each vertex v ∈ V. A
various edges spanning all other vertices
the source vertex s ∈ V in a given graph
predecessor map π, which is a map from
its parent. When the predecessor map π(v)
Algorithm 1: Level-synchronized Parallel BFS.
Input : G = (V,A) : unweighted directed graph.
s : source vertex.
Variables: QF : frontier queue.
QN : neighbor queue.
visited : vertices already visited.
Output : π(v) : predecessor map of BFS tree.
1 π(v)←−1, #v ∈ V
2 π(s) ← s
3 visited ← {s}
4 QF ← {s}
5 QN ← ∅
6 while QF̸= ∅ do
7 for v ∈ QF in parallel do
8 for w ∈ A(v) do
9 if w̸∈ visited atomic then
10 π(w) ← v
11 visited ← visited ∪ {w}
12 QN ← QN ∪ {w}
13 QF ← QN
14 QN ← ∅
Traversal
Swap
Frontier
Neighbor
Level k Level k+1 Q F
Swap exchanges the frontier
QF and the neighbors QN for
next level

Direction-optimizing BFS
Chooses one from Top-down or Bottom-up Beamer2012 @ SC2012
前方探索と後方探索でのデータアクセスの観察
Bottom-up algorithm
• Efficient for large-frontier
• Uses in-coming edges
• 後方探索でのデータの書込み
Candidates of
neighbors
前方探索と後方探索でのデータアクセスの観察
Top-down algorithm
• Efficient for small-frontier
• Uses out-going edges
• 前方探索でのデータの書込み
• 前方探索でのデータの書込み
Outgoing
edges Incoming
v
Frontier
v → w
Level7k
v → w
v
Neighbors
Level7k+1
w
w
Input : Directed graph G = (V, AF ), Queue QF
Data : Queue QN, visited, Tree π(v)
QN ← ∅
for v ∈ QF in parallel do
Input : Directed graph G = (V, AF ), Queue QF
QN ← ∅
for v ∈ QF in parallel do
Candidates of
edges
Current for for w ∈ w AF ∈ AF (v) (do
v) do
if w if ! w visited ! visited atomic atomic then
then
π(w) π(w) ← ← v
v
visited visited ←←visited visited ∪ ∪ QN {w}
{w}
QN ← QN QN ∪ {w}
QF ← QN
← ∪ {w}
どちらもに関する変数とに書込みを行っているは点番号の参照• 後方探索でのデータの書込み
v w
Frontier
Level7k
w w → v
v
v w
Level7k+1
neighbors
Input : Directed graph G = (V, AB), Queue QF
QN ← ∅ for w do
Input : Directed graph G = (V, AB), Queue QF
Data Tree π(v)
QN for w ∈ V visited in parallel do
for v ∈ AB(w) do
if v ∈ QF then
π(w) ← v
visited ←visited ∪ {w}
QN ← QN ∪ {w} break
for QF ← QN
QF ← QN
π(w) v
visited ←visited ∪ {w}
QN ← QN ∪ {w} break
QF ← QN
frontier
Unvisited
neighbors
neighbors
Current frontier
Skips unnecessary edge traversal
• どちらもw に関する変数π(w) とvisited に書込みを行っている(v は点番号の参照)

Direction-optimizing BFS
Chooses one from Top-down or Bottom-up Beamer2012 @ SC2012
Small frontier large frontier
探索に# 対of すtraversal る前方探edges 索(Top-of Kronecker down) graph と後with 方探SCALE 索(Bottom-26
up)
Top%down Level Top-down Bottom-up Hybrid
0 2 2,103,840,895 2
1 66,206 1,766,587,029 66,206
2 346,918,235 52,677,691 52,677,691
3 1,727,195,615 12,820,854 12,820,854
4 29,557,400 103,184 103,184
5 82,357 21,467 21,467
6 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631
Ratio 100.00% 187.09% 3.12%
Hybrid-BFS reduces
unnecessary edge traversals
Bottom%up(
Top%down
Distance from source
|V| = 226, |E| = 230
= |E|

NUMA-optimized Dir. Opt. BFS
50
• Manages memory accesses on NUMA system
40
30
20
10
0
– Each NUMA node contains CPU socket and local memory
2011
NUMA%aware
SC10
Top%down
SC12
BigData13
ISC14
G500,ISC14
GTEPS
Reference
NUMA-aware
Dir.Opt.
NUMA-Opt.
NUMA-Opt. + Deg.aware
NUMA-Opt. + Deg.aware + Vtx.Sort
87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489
• CPU: Intel Xeon
• #sockets: 4
• #cores: 32 or 40
• RAM: 256GB or 512GB
Top%down Top%down
Bottom%up(
CPU
RAM
System7configuration

NUMA-optimized Dir. Opt. BFS
50
40
30
20
10
0
2011
NUMA%aware Bottom%up(
SC10
Top%down
SC12
NUMA%aware
Our results
BigData13
Partitioning
0 1 2 3
ISC14
Adjacency
Matrix
Binding on NUMA
RAM RAM
0th 3th
1st 2nd
RAM RAM
RAM RAM
RAM RAM
RAM RAM
G500,ISC14
GTEPS
Reference
NUMA-aware
Dir.Opt.
NUMA-Opt.
87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489
• CPU: Intel Xeon
• #sockets: 4
• #cores: 32 or 40
Top%down Top%down
Bottom%up(
CPU
RAM
Top%down
CPU
RAM

NUMA architecture
• 4-way Intel Xeon E5-4640 (Sandybridge-EP)
– 4 (# of CPU sockets)
– 8 (# of physical cores per socket)
– 2 (# of threads per core)
NUMA node
Max.
4 x 8 x 2 = 64 threads
Memory access for Local RAM（Fast）
RAM RAM
RAM RAM
NUMA node
CPU socket（16 logical cores）
+ Local RAM
Memory access for Remote RAM（Slow）
NUMA-aware (optimized) computation
• Reduces and avoids memory accesses for Remote RAM

Flow of affinities using ULIBC
ULIBC : Ubiquity Library for Intelligently Binding Cores
– provides some APIs to utilizing processor topology easily.
(Our(library)

1. Detects entire topology
Use(Other(
processes
Cores
CPU 0 P0, P4, P8, P12
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
CPU 3 P3, P7, P11, P15
(Our(library)

2. Detects online (available) topology
Cores
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
Job manager (PBS) or
numactl --cpunodebind=1,2
Use(Other(
processes
Cores
CPU 0 P0, P4, P8, P12
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
CPU 3 P3, P7, P11, P15
(Our(library)

2. Detects online (available) topology
Cores
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
Job manager (PBS) or
numactl --cpunodebind=1,2
3. Constructs ULIBC affinity
core 0
Threads
Use(Other(
processes
Cores
CPU 0 P0, P4, P8, P12
CPU 1 P1, P5, P9, P13
CPU 2 P2, P6, P10, P14
CPU 3 P3, P7, P11, P15
ULIBC_set_affinity_policy(
7, SCATTER_MAPPING, THREAD_TO_CORE)
core 3
Local RAM
RAM
NUMA 0 0(P1), 2(P5), 4(P9), 6(P13)
NUMA 1 1(P2), 3(P6), 5(P10)
# of threads
Scatter-type mapping
Each thread binds each logical cores
NUMA 0
NUMA 1
core 1c ore 2
RAM
(Our(library)

NUMA-optimized BFS
• The 1-D column-wise partitioning for adjacency matrix
Partitioning
Adjacency 0 1 2 3
Matrix
0th 3th
RAM RAM
RAM RAM
RAM RAM
RAM RAM
RAM RAM
1st 2nd
• Local traversal and all-to-all comm. for each level
Edge traversal on local RAM All-gathering of next frontier
Each NUMA node searches unvisited
vertices from duplicated frontier
Out Out Out Out
In In In In
0 1 2 3
RAM RAM
RAM RAM
RAM RAM
RAM RAM
RRAAMM RAMRAM
RAM RAM
RAM
RAM RAM
RAM RAM
RAM RAM
RAM RAM
Construct duplicated frontiers from
partial neighbors
Local neighbors
Duplicated
frontiers
Binding
Inner-NUMA-node Inter-NUMA-node

Degree-aware + NUMA-opt. + Dir. Opt. BFS
50
40
30
20
10
0
2011
NUMA%aware Bottom%up(
SC10
Top%down
SC12
NUMA%aware
BigData13
ISC14
Our results
1. Deleting isolated vertices
Isolated
2. Sorting adjacency vertices
A(va)
… …
A(va)
… …
Sorted by degree
G500,ISC14
GTEPS
Reference
NUMA-aware
Dir.Opt.
NUMA-Opt.
87M 800M
5G
11G
29G
42G
⇥1 ⇥9
⇥58
⇥125
⇥334
⇥489
• CPU: Intel Xeon
• #sockets: 4
• #cores: 32 or 40
Top%down Top%down
Bottom%up(
CPU
RAM
Top%down
CPU
RAM

TEPS and TEPS/W on single-server
for Graph500 for Green Graph500
• Strong scaling for SCALE 27
64
32
16
8
4
2
1
4-way SB-EP based Xeon
1 2 4 8 16 32 64
64
32
16
8
4
2
1
relative GTEPS
relative MTEPS/W
Number of threads
GTEPS
MTEPS/W
29.03 GTEPS
45.43 MTEPS/W
Relative improvements
Number of threads
x 27.9
x 12.6

SGI UV 2000 system
• Shared-memory supercomputer
– handle large memory space using thread parallel.
– C/C++ with OpenMP/Pthreads (w/o MPI comm.)
– cc-NUMA architecture system base on Intel Xeon
• ISM has two Full-spec. UV 2000
– 4 UV 2000 racks
– Up to 2,560 cores and 64 TB memory
• ISM, SGI, and us collaborate for Graph500
– achieves the fastest of single-node in current list
The Institute of Statistical Mathematics
• Japan's national research institute for
statistical science.
UV2000 rack
#1 system #2 system

SGI UV 2000 configuration
• UV2000 has complex hardware topologies
– Socket, Node, Cube, Inner-rack, and Inter-rack
Node = 2 sockets Cube = 8 nodes Rack = 32 nodes
CPU
RAM
CPU
RAM
× 4 =
• We used NUMA-based flat parallelization
– Each NUMA node contains a “Xeon CPU E5-2470 v2” and a “256 GB RAM”
Node = 2 NUMA nodes Rack = 64 NUMA nodes
CPU
RAM
× 64 =
CPU
RAM
Cube = 16 NUMA nodes
× 2
CPU
RAM
× 16
NUMAlink(
6.7GB/s
(20(cores,(512GB) (160(cores,(4TB) (640(cores,(16TB)

200
150
100
50
0
Weak scaling on UV2000
26
(` = 1)
Graph500 June list
Fastest of single node
131 GTEPS
Most power-efficient
commercial supercomputer
June 2014
27
(` = 2)
Weak scaling on UV 2000
28
(` = 4)
29
(` = 8)
30
(` = 16)
31
(` = 32)
32
(` = 64)
33
(` = 128)
34
(` = 256)
GTEPS SCALE (` = #sockets)
Inner%rack(comm. Inter%rack
12.481 MTEPS
= 131 GTEPS / 10.53 kW

The Graph500 List in June 2014
hp://www.graph500.org
• Measures performance using TEPS (# of Traversed edges
per second) in graph traversal such as BFS
Fastest.of.
......single5node
Fastest.of..
......single5server
Distributed
Memory
Distributed
Memory
Shared
Memory
Distributed
Memory
Shared
Memory
Fastest.of..
mul:5node

The Green Graph500 List in June 2014
http://green.graph500.org
Big Data category ( SCALE 30)
Small data category ( SCALE 29)
SONY Xperia-Z1-SO-01F
Measures power-efficiency using TEPS/W
George Washington University’s
Colonial
is ranked
No.1
in the Small Data category of the Green Graph 500
Ranking of Supercomputers with
445.92 MTEPS/W on Scale 20
on the third Green Graph 500 list published at the
International Supercomputing Conference, June 23, 2014.
Congratulations from the Green Graph 500 Chair
Kyushu’s University
GraphCREST-SandybridgeEP-2.4GHz
is ranked
No.1
in the Big Data category of the Green Graph 500
Ranking of Supercomputers with
59.12 MTEPS/W on Scale 30
on the third Green Graph 500 list published at the
International Supercomputing Conference, June 23, 2014.
Congratulations from the Green Graph 500 Chair
Ours
UV2000
Ours
4-way Xeon server
TSUBAME-KFC

Weak scaling on UV2000
200
150
100
50
0
26
(` = 1)
June 2014
Nov. 2014
27
(` = 2)
Weak scaling on UV 2000
28
(` = 4)
Fastest of single node
in Graph500 June
29
(` = 8)
131 GTEPS
30
(` = 16)
31
(` = 32)
32
(` = 64)
New result
174 GTEPS
33
Two racks
(` = 128)
34
(` = 256)
GTEPS SCALE (` = #sockets)
Inner%rack(comm. Inter%rack(comm.

Graph500 and Green Graph500 Benchmarks on SGI UV2000

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Graph500 and Green Graph500 Benchmarks on SGI UV2000

Ähnlich wie Graph500 and Green Graph500 Benchmarks on SGI UV2000 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (17)

Graph500 and Green Graph500 Benchmarks on SGI UV2000