[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data

TSUBAME2.5 to 3.0 and
Convergence with Extreme
Big Data
Satoshi Matsuoka
Professor
Global Scientific Information and Computing (GSIC) Center
Tokyo Institute of Technology
Fellow, Association for Computing Machinery (ACM)
Rakuten Technology Conference 2013
2013/10/26
Tokyo, Japan

Supercomputers from the Past

Fast, Big, Special, Inefficient,
Evil device to conquer the world…

Let us go back to the mid ’70s
Birth of “microcomputers” and arrival of
commodity computing (start of my career)

•

Commodity 8-bit CPUs…

– Intel 4004/8008/8080/8085,
Zilog Z-80, Motorola 6800, MOS
Tech. 6502, …

•

Lead to hobbyist computing…
– Evaluation boards: Intel SDK-80,
Motorola MEK6800D2, MOS Tech.
KIM-1, (in Japan) NEC TK-80,
Fujitsu Lkit-8, …
– System Kits: MITS Altair
8800/680b, IMSAI 8080, Proc.
Tech. SOL-20, SWTPC 6800, …

•

& Lead to early personal computers
– Commodore PET, Tandy TRS-80,
Apple II
– (in Japan): Hitachi Basic Master,
NEC CompoBS / PC8001, Fujitsu
FM-8, …

Supercomputing vs. Personal Computing
in the late 1970s.
• Hitachi Basic Master
(1978)

– “The first PC in Japan”
– Motorola 6802--1Mhz,
16KB ROM, 16KB RAM
– Linpack in BASIC: Approx.
70-80 FLOPS (1/1,000,000)

• We got “simulation” done
(in assembly language)
– Nintendo NES (1982)
• MOS Technology 6502
1Mhz (Same as Apple II)

– “Pinball” by Matsuoka &
Iwata (now CEO Nintendo)
• Realtime dynamics +
collision + lots of shortcuts

• Average ~a few KFLOPS

Cf. Cray-1
Running Linpack 10
(1976)
Linpack
80-90MFlops
(est.)

Then things got accelerated
around the mid 80s to mid 90s
(rapid commoditization towards what we use now)
• PC CPUs: Intel 8086/286/386/486/Pentium (Superscalar&fast FP
x86), Motorola 68000/020/030/040, … to Xeons, GPUs, Xeon Phi’s
– C.f. RISCs: SPARC, MIPS, PA-RISC, IBM Power, DEC Alpha, …

• Storage Evolution: Cassettes, Floppies to HDDs, optical disk to Flash
• Network Evolution: RS-232C to Ethernet now to FDR Infinininband
• PC (incl. I/O): IBM PC “Clones” and Macintoshes: ISA to VLB to PCIe
• Software Evolution: CP/M to MS-DOS to Windows, Linux,
• WAN Evolution: RS-232+Modem+BBS to Modem+Internet to
ISDN/ADSL/FTTH broadband, DWDM Backbone, LTE, …
• Internet Evolution: email + ftp to Web, Java, Ruby, …

• Then Clusters, Grid/Clouds, 3-D Gaming, and
Top500 all started in the mid 90s(!), and
commoditized supercomputing

Modern Day Supercomputers

 Now supercomputers “look like” IDC

servers
 High-End COTS dominate

Linux based machine with standard + HPC OSS Software Stack
NEC Confidential

1957

2010

“Reclaimed No.1 Supercomputer
Rank in the World”

2011

2012
7

Top Supercomputers vs. Global IDC

K Computer (#1 2011-12) Riken-AICS
Fujitsu Sparc VIII-fx Venus CPU
88,000 nodes, 800,000CPU cores
~11 Petaflops (1016)
1.4 Petabyte memory, 13 MW Power
864 racks、3000m2

Tianhe2 (#1 2013) China Gwanjou
48,000 KNC Xeon Phi + 36,000 Ivy
Bridge Xeon
18,000 nodes, >3 Million CPU cores
54 Petaflops (1016)
0.8 Petabyte memory, 20 MW Power
??? racks、???m2

C.f. Amazon ~= 450,000 Nodes, ~3 million Cores
#1 2012 IBM BlueGene/Q “Sequoia”
Lawrence Livermore National Lab
DARPA study
IBM PowerPC System-On-Chip
98,000 nodes, 1.57million Cores 2020 Exaflop (1018)
~20 Petaflops
100 million~
1.6 Petabytes, 8MW, 96 racks
NEC Confidential
1 Billion Cores

Scalability and Massive Parallelism

 More nodes & core => Massive Increase in

parallelism
Faster, “Bigger” Simulation
Qualitative Difference

Performance

BAD!

GOOD!

BAD!

Ideal Linear
Scaling Difficult
to Achieve

Limitations
in Power,
Cost,
Reliability
Limitations
in Scaling

CPU Cores ~= Parallelism

NEC Confidential

2006: TSUBAME1.0
as No.1 in Japan
All University Centers
COMBINED 45 TeraFlops

>
Total 85 TeraFlops,
#7 Top500 June 2006

Earth Simulator
40TeraFlops #1 2002~2004

TSUBAME2.0 Nov. 1, 2010
“The Greenest Production Supercomputer in the World”

TSUBAME 2.0
New Development

32nm

40nm

>12TB/s Mem BW
>400GB/s Mem BW >1.6TB/s Mem BW
35KW Max
80Gbps NW BW
~1KW max
12

>600TB/s Mem BW
220Tbps NW
Bisecion BW
1.4MW Max

1500
1250
1000
750
500

CPU
250
0

GPU

GPU

Memory Bandwidth [GByte/s]

Peak Performance [GFLOPS]

1750

Performance Comparison of
CPU vs. 200
GPU
160
120
80

CPU
40
0

x5-6 socket-to-socket advantage in both
compute and memory bandwidth,
Same power
(200W GPU vs. 200W CPU+memory+NW+…)

TSUBAME2.0 Compute Node
Thin
Node
Infiniband QDR
x2 (80Gbps)

1.6 Tflops
400GB/s
Mem BW
80GBps NW
~1KW max

Productized
as HP
ProLiant

SL390s

HP SL390G7 (Developed for
TSUBAME 2.0)

GPU: NVIDIA Fermi M2050 x 3
515GFlops, 3GByte memory /GPU
CPU: Intel Westmere-EP 2.93GHz x2
(12cores/node)
Multi I/O chips, 72 PCI-e (16 x 4 + 4
x 2) lanes --- 3GPUs + 2 IB QDR
Memory: 54, 96 GB DDR3-1333
SSD：60GBx2, 120GBx2

NEC Confidential

Total Perf
2.4PFlops
Mem： ~100TB
SSD: ~200TB

4-1

TSUBAME2.0 Storage Overview
TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)
Infiniband QDR Network for LNET and Other Services
QDR IB (×4) × 8

QDR IB(×4) × 20
GPFS#1

SFA10k #1

SFA10k #2

/work9

“Global Work Space” #1

GPFS with HSM

SFA10k #3

SFA10k #4

SFA10k #5

/work0

/work19

/gscr0

“Global Work
Space” #2

“Global Work
Space” #3

Lustre

“Scratch”

3.6 PB 30~60GB/s

GPFS#2

GPFS#3

10GbE × 2

GPFS#4

HOME

HOME
System
application

iSCSI

SFA10k #6

“cNFS/Clusterd Samba w/ GPFS”

“NFS/CIFS/iSCSI by
BlueARC”

Home Volumes

1.2PB

Parallel File System Volumes

2.4 PB HDD +
〜4PB Tape

“Thin node SSD”

“Fat/Medium node SSD”

250 TB, 300~500GB/s
Scratch

130 TB=> 500TB~1PB
Grid Storage

TSUBAME2.0 Storage Overview
TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)
Infiniband QDR Network for LNET and Other Services
QDR IB (×4) × 8

QDR IB(×4) × 20
GPFS#1

Concurrent Parallel I/O
(e.g. MPI-IO)
SFA10k #1

SFA10k #2

/work9

SFA10k #3

SFA10k #4

SFA10k #5

/work0

/work19

/gscr0

Read mostly I/O
(data-intensive apps, parallel workflow,
“Global Work
“Global Work
parameterSpace” #1
survey)
Space” #2
“Global Work
Space” #3
GPFS with HSM

“Scratch”

Lustre 3.6
Fine-grained R/W PB
I/O
Parallel File System Volumes
(checkpoints, temporary files,
Big Data processing)

GPFS#2

GPFS#3

10GbE × 2

GPFS#4

• Home storage for computing nodes
•HOME
Cloud-based campus storage HOME
services
System
application

iSCSI

SFA10k #6

“cNFS/Clusterd Samba w/ GPFS”

“NFS/CIFS/iSCSI by
BlueARC”

Home Volumes

1.2PB

Data transfer service
between SCs/CCs
2.4Long-Term
PB HDD +
Backup
〜4PB Tape

“Thin node SSD”

“Fat/Medium node SSD”

250 TB, 300GB/s
Scratch

130 TB=> 500TB~1PB
HPCI Storage

3500 Fiber Cables > 100Km
w/DFB Silicon Photonics
End-to-End 7.5GB/s, > 2us
Non-Blocking 200Tbps Bisection

NEC Confidential

2010: TSUBAME2.0 as No.1 in Japan

>
Total 2.4 Petaflops
#4 Top500, Nov. 2010

All Other Japanese
Centers on the Top500
COMBINED 2.3 PetaFlops

TSUBAME Wins Awards…
“Greenest Production
Supercomputer in the
World”
the Green 500
Nov. 2010, June 2011
(#4 Top500 Nov. 2010)

3 times more power
efficient than a laptop!


ACM Gordon Bell Prize 2011
2.0 Petaflops Dendrite Simulation
Special Achievements in Scalability and Time-to-Solution
“Peta-Scale Phase-Field Simulation for Dendritic
Solidification on the TSUBAME 2.0 Supercomputer”


Commendation for Sci &Tech by
Ministry of Education 2012
(文部科学大臣表彰)
Prize for Sci & Tech, Development Category
Development of Greenest Production Peta-scale Supercomputer

Satoshi Matsuoka, Toshio Endo, Takayuki Aoki

Precise Bloodflow Simulation of Artery on
TSUBAME2.0
(Bernaschi et. al., IAC-CNR, Italy)
Personal CT Scan + Simulation
=> Accurate Diagnostics of Cardiac Illness
5 Billion Red Blood Cells + 10 Billion Degrees of
Freedom

MUPHY: Multiphysics simulation of blood flow
(Melchionna, Bernaschi et al.)
Combined Lattice-Boltzmann (LB)
simulation for plasma and Molecular
Dynamics (MD) for Red Blood Cells
Realistic geometry ( from CAT scan)
Multiphyics simulation
with MUPHY software
Fluid: Blood plasma
Lattice Boltzmann

Body: Red blood cell
coupled

Irregular mesh is divided by
using PT-SCOTCH tool,
considering cutoff distance

Extended MD

Red blood cells
(RBCs) are
represented as
ellipsoidal particles

Two-levels of parallelism: CUDA (on
GPU) + MPI
• 1 Billion mesh node for LB
component
•100 Million RBCs 4000 GPUs,

ACM
Gordon Bell
Prize 2011
Honorable
Mention

0.6Petaflops

Lattice-Boltzmann-LES with
Coherent-structure SGS model
[Onodera&Aoki2013]
Coherent-structure Smagorinsky model
Second invariant of the velocity gradient
tensor(Q) and
the
Energy dissipation(ε)

The model parameter is locally determined by
second invariant of the velocity gradient tensor.

◎ Turbulent flow around a complex
object
◎ Large-scale parallel computation
Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology

Computational Area – Entire
Downtown Tokyo
Major part of Tokyo
Including Shnjuku-ku,
Chiyoda-ku, Minato-ku,
Meguro-ku, Chuou-ku,

Shinjyuku

Tokyo

10km×10km
Building Data:
Pasco Co. Ltd.
TDM 3D
Achieved 0.592 Petaflops
using over 4000 GPUs
(15% efficiency)

Shibuya

Shinagawa

Map ©2012 Google, ZENRIN

Copyright © Global Scientific Information and Computing Center, Tokyo Institute of Technology

25

Area Around
Metropolitan Government Building
Flow profile at the 25m height on the ground

Wind

640 m

960 m

地図データ

©2012 Google, ZENRIN


27


28

Current Weather Forecast
5km Resolution
(Inaccurate Cloud Simulation)

ASUCA Typhoon Simulation on TSUBAME2.0
500m Resolution 4792×4696×48 , 437 GPUs
(x1000 resolution)

29

CFD analysis over a car body
Calculation conditions

＊ Number of grid points：
3,623,878,656
(3,072 × 1,536 × 768)
＊Grid resolution：4.2mm
(13m x 6.5 m x 3.25m)
＊Number of GPUs：
288 (96 Nodes)

60 km/h

LBM: DriVer: BMW-Audi
Lehrstuhl für Aerodynamik und Strömungsmechanik
Technische Universität München

3,000x1,500x1,500
Re = 1,000,000

32

Industry prog.: TOTO INC.
TSUBAME 150 GPUs

In-House Cluster

アステラス製薬とのデング熱等の熱帯病の
特効薬の創薬

Accelerate In‐
silico screeninig
and data mining

100‐million‐atom MD Simulation

M. Sekijima (Tokyo Tech), Jim Phillips (UIUC)

Mixed Precision Amber on Tsubame2.0
for Industrial Drug Discovery

x10 faster
Mixed‐Precision
ヌクレオソーム (25095 粒子)

$500mil~$1bil dev. cost per
drug
Even 5-10% improvement of
the process will more than
pay for TSUBAME

75% Energy Efficient

Towards TSUBAME 3.0
Interim Upgrade TSUBAME2.0 to 2.5
(Early Fall 2013)

• Upgrade the TSUBAME2.0s GPUs
NVIDIA Fermi M2050 to Kepler K20X
SFP/DFP peak from
4.8PF/2.4PF => 17PF/5.7PF

TSUBAME2.0 Compute Node
Fermi GPU 3 x 1408 = 4224 GPUs

c.f. The K Computer 11.2/11.2
Acceleration of Important Apps
Considerable Improvement
Summer 2013

Significant Capacity
Improvement at low cost
& w/o
Power Increase
TSUBAME3.0 2H2015

TSUBAME2.0⇒2.5 Thin Node Upgrade
Thin
Node
Infiniband QDR
x2 (80Gbps)

Peak Perf.

4.08 Tflops
~800GB/s
Mem BW
80GBps NW
~1KW max

HP SL390G7 (Developed for
TSUBAME 2.0, Modified for 2.5)
GPU: NVIDIA Kepler K20X x 3
1310GFlops, 6GByte Mem(per GPU)

Productized
as HP
ProLiant

SL390s
Modified for
TSUABME2.5

CPU: Intel Westmere-EP 2.93GHz x2
Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2)
lanes --- 3GPUs + 2 IB QDR
Memory: 54, 96 GB DDR3-1333
SSD：60GBx2, 120GBx2

NVIDIA Fermi
M2050
1039/515
GFlops

NVIDIA Kepler
K20X
3950/1310
GFlops

2013: TSUBAME2.5 No.1 in Japan in
Single Precision FP, 17 Petaflops

All University Centers
COMBINED 9 Petaflops SFP

~=
Total
17.1 Petaflops SFP
5.76 Petaflops DFP

K Computer
11.4 Petaflops SFP/DFP

TSUBAME2.0
TSUBAME2.5
Thin Node x 1408 Units
Node Machine
CPU

HP Proliant SL390s
Intel Xeon X5670

← No Change
← No Change

(6core 2.93GHz, Westmere) x 2

GPU

Node
Performance
(incl. CPU
Turbo boost)

NVIDIA Tesla M2050 x 3
 448 CUDA cores (Fermi)
 SFP 1.03TFlops
 DFP 0.515TFlops
 3GiB GDDR5 memory
 150GB Peak, ~90GB/s
STREAM Memory BW
 SFP 3.40TFlops
 DFP 1.70TFlops
 ~500GB Peak, ~300GB/s
STREAM Memory BW

NVIDIA Tesla K20X x 3
 2688 CUDA cores (Kepler)
 SFP 3.95TFlops
 DFP 1.31TFlops
 6GiB GDDR5 memory
 250GB Peak, ~180GB/s
STREAM Memory BW
 SFP 12.2TFlops
 DFP 4.08TFlops
 ~800GB Peak, ~570GB/s
STREAM Memory BW

TOTAL System
Total System
Performance

 SFP 4.80PFlops
 DFP 2.40PFlops
 Peak ~0.70PB/s, STREAM
~0.440PB/s Memory BW

 SFP 17.1PFlops (x3.6)
 DFP 5.76PFlops (x2.4)
 Peak ~1.16PB/s, STREAM
~0.804PB/s Memory BW (x1.8)

Phase‐field simulation for Dendritic Solidification
[Shimokawabe, Aoki et. al.]

Weak scaling on TSUBAME (Single precision)
Mesh size（1GPU+4 CPU cores）:4096 x 162 x 130
TSUBAME 2.5
3.444 PFlops
(3,968 GPUs+15,872 CPU cores)
4,096 x 5,022 x 16,640

Developing lightweight strengthening
material by controlling microstructure

TSUBAME 2.0
2.000 PFlops
(4,000 GPUs+16,000 CPU cores)

Low‐carbon society
4,096 x 6,480 x 13,000

•
•

Peta‐Scale phase‐field simulations can simulate the multiple dendritic growth during
solidification required for the evaluation of new materials.
2011 ACM Gordon Bell Prize Special Achievements in Scalability and Time‐to‐Solution

Peta‐scale stencil application :
A Large‐scale LES Wind Simulation using Lattice Boltzmann Method
[Onodera, Aoki et. al.]
Weak scalability in single precision

Large-scale Wind Simulation for a
10km x 10km Area in Metropolitan Tokyo

(N = 192 x 256 x 256)

Performance [TFlops]

10,080 x 10,240 x 512 (4,032 GPUs)

The above peta‐scale simulations were executed as the
TSUBAME Grand Challenge Program, Category A in 2012 fall.

•
•

▲ TSUBAME 2.5 (overlap)
● TSUBAME 2.0 (overlap)
TSUBAME 2.5
1142 TFlops (3968 GPUs)
288 GFlops / GPU

x 1.93

TSUBAME 2.0
149 TFlops (1000 GPUs)
149 GFlops / GPU

Number of GPUs

The LES wind simulation for the area 10km × 10km with 1‐m resolution has never
been done before in the world.
We achieved 1.14 PFLOPS using 3968 GPUs on the TSUBAME 2.5 supercomputer.

AMBER pmemd benchmark
Nucleosome = 25,095 atoms
11.39

K20X×8

6.66

K20X×4

4.04

K20X×2

3.11
3.44

K20X×1
M2050×8
M2050×4
M2050×2

2.22
1.85

0.99
0.31
MPI 4node
0.15
MPI 2node
0.11
M2050×1

MPI 1node
(12 core) 0

Dr.Sekijima@Tokyo Tech

2

TSUBAME2.0 M2050
TSUBAME2.5 K20X

4

6

ns/day

8

10

12

Application

TSUBAME2.0
Performance

TSUBAME2.5
Performance

Boost
Ratio

Top500/Linpack
(PFlops)

1.192

2.843

2.39

Green500/Linpack
(GFlops/W)

0.958

> 2.400

> 2.50

Semi‐Definite Programming
Nonlinear Optimization (PFlops)

1.019

1.713

1.68

Gordon Bell Dandrite Stencil
(PFlops)

2.000

3.444

1.72

LBM LES Whole City Airflow
(PFlops)

0.600

1.142

1.90

Amber 12 pmemd 4 nodes 8
GPUs (nsec/day)

3.44

11.39

3.31

GHOSTM Genome Homology
Search (Sec)

19361

10785

1.80

MEGADOC Protein Docking (vs.
1CPU core)

37.11

83.49

2.25

TSUBAME Evolution
Towards Exascale and Extreme Big Data

25‐30PF
1TB/s

5.7PF
Graph 500
No. 3 (2011)

Awards

3.0

Phase2
Fast I/O
2.5
5~10PB
Phase1
10TB/s
Fast I/O
> 100mil
250TB
300GB/s iOPs
30PB/Day 1ExaB/Day
2015H2


47

x

DoE Exascale Parameters
x1000 power efficiency in 10 years
System
attributes

“2010”

“2015”

System peak

2 PetaFlops

100-200 PetaFlops 1 ExaFlop

Power

Jaguar

TSUBAME

6 MW

1.3 MW

15 MW

20 MW

System Memory

0.3PB

0.1PB

5 PB

32-64PB

Node Perf

125GF

1.6TF

0.5TF

Node Mem BW

25GB/s

Node Concurrency

#Nodes

1TF

10TF

0.5TB/s 0.1TB/s 1TB/s

0.4TB/s

4TB/s

12

O(1000) O(100)

O(1000)

O(1000)

18,700

1442

50,000

5,000

1 million

8GB/s

20GB/s

200GB/s

O(1 day)

O(1 day)

Total Node
1.5GB/s
Interconnect BW
MTTI

“2020”

O(days)

7TF

O(10000
)
Billion Cores

100,000

Challenges of Exascale (FLOPS, Byte, …) (1018)!
Various Physical Limitations Surface All‐at‐Once

• # CPU Cores: 1Bil
Low Power
• # Nodes 100K~xM

c.f. Total # of Smartphones sold
globally = 400Mil
c.f. The K Computer ~100K
Google ~ 1 Mil
• Mem: x00PB~ExaB
c.f. Total mem all PCs (300Mil)
shipped globally in 2011 ~ ExaB
BTW 264~=1.8x1019=18ExaB
• Storage： xExaB c.f. Google Storage
2 Exabytes (200Mil x 7GB+)
• All of this at 20MW (50GFlops/W), reliability (MTTI=days),
ease of programming (billion cores?), cost… in 2020?!
49

Focused Research Towards

Tsubame 3.0 and Beyond towards Exa
• Green Computing: Ultra Power Efficient HPC
• High Radix Bisection Networks – HW, Topology, Routing
Algorithms, Placement…
• Fault Tolerance – Group-based Hierarchical
Checkpointing, Fault Prediction, Hybrid Algorithms
• Scientific “Extreme” Big Data – Ultra Fast I/O, Hadoop
Acceleration, Large Graphs
• New memory systems – Pushing the envelops of low
Power vs. Capacity vs. BW, exploit the deep hierarchy
with new algorithms to decrease Bytes/Flops
• Post Petascale Programming – OpenACC and other manycore programming substrates, Task Parallel
• Scalable Algorithms for Many Core –
Apps/System/HW Co-Design

JST‐CREST “Ultra Low Power (ULP)‐HPC” Project
2007‐2012
Ultra Multi-Core
Slow & Parallel
(& ULP)

Auto‐Tuning for Perf. & Power
ULP‐HPC SIMD‐
Vector
(GPGPU, etc.)

ABCLibScript: アルゴリズム選択

モデルと実測の Bayes 的融合

実行起動前自動チューニング指定、

• Bayes モデルと事前分布

アルゴリズム選択処理の指定
!ABCLib$ static select region start
!ABCLib$ parameter (in CacheS, in NB, in NPrc)
コスト定義関数で使われる
!ABCLib$
select sub region start
入力変数
!ABCLib$
according estimated
!ABCLib$
(2.0d0*CacheS*NB)/(3.0d0*NPrc)

モデルによる
所要時間の推定

yi ~ N (  i ,  )
2
i

i |  ,  i2 ~ N ( xiT  ,  i2 /  0 )

コスト定義関数

 i2 ~ Inv -  2 (v0 ,  02 )

ULP‐HPC
Networks

MRAM
PRAM
Flash
etc.

所要時間の実測データ

• n 回実測後の事後予測分布

yi | ( yi1 , yi 2 , , yin ) ~ t n ( in ,   n 1 /  n )
2
in

0

 n   0  n,  n   0  n,  n  ( 0 xiT   nyi ) /  n

Low Power
High Perf Model

Power Optimize using Novel Components
in HPC

2
 n n   0 02   ( ym  yi ) 2   0 n( yi  xiT  ) 2 /  n

対象１（アルゴリズム１）
select sub region end
select sub region start
according estimated
(4.0d0*ChcheS*dlog(NB))/(2.0d0*NPrc)
対象２（アルゴリズム２）
!ABCLib$ select sub region end
!ABCLib$ static select region end
!ABCLib$
!ABCLib$
!ABCLib$
!ABCLib$

1

1
yi   yim
n

Optimization
Point
Power

Perf

x10 Power
Efficiencty

0

Perf. Model
Algorithms
Power‐Aware and Optimizable Applications

対象領域1、2

x1000
Improvement in 10 years

Aggressive Power Saving in HPC
Methodologies

Enterprise/Business
Clouds

HPC

Server Consolidation

Good

NG!

Good

Poor

Poor

Good

Poor

Good

Limited

Good

DVFS

(Dynamic Voltage/Frequency
Scaling)

New Devices
New HW &SW
Architecture

Novel Cooling

(Cost & Continuity)

(Cost & Continuity)

(Cost & Continuity)

(high thermal density)

How do we achive x1000?
Process Shrink x100
X
Many-Core GPU Usage x5
X
DVFS & Other LP SW x1.5
X
Efficient Cooling x1.4
x1000 !!!

ULP-HPC
Project
2007-12
Ultra Green
Supercomputing
Project 2011-15

Statistical Power Modeling of GPUs
[IEEE IGCC10]
GPU performance counters

n

p    i ci  
i 1

• Estimates GPU power consumption GPU
statistically
• Linear regression model using performance
counters as explanatory variables

High accuracy（Avg Err 4.7％）

Average power consumption
• Prevents overtraining by ridge regression
• Determines optimal parameters by cross
fitting

Power meter with
high resolution

Accurate even with DVFS
Future: Model‐based power opt.
Linear model shows sufficient accuracy
Possibility of optimization of Exascale systems
with O(10^8) processors

Power Efficiency in Denderite Applications
on TSUBAME1.0 thru JST‐CREST ULPHPC
Prototype running Gordon Bell Denderite App

TSUBAME‐KFC: Ultra‐Green Supercomputer Testbed
[2011‐2015]
Fluid Submersion Cooling + Outdoor Air Cooling + High
Density GPU Supercomputing in a 20‐feet container
Compute Nodes
K20X GPU

GRC Submersion Rack Heat Exchanger
Processors 80~90℃
Oil 35~45℃
⇒ Coolant oil 35~45℃ ⇒ Water 25~35℃

Heat
Dissipation

NEC/SMC 1U server x 40
• Intel IvyBridge 2.1GHz 6core×2
• NVIDIA Tesla K20X GPU ×4
• DDR3 memory 64GB, SSD 120GB
• 4x FDR InfiniBand 56Gbps
Per node

Total Peak
210TFlops (DP)
630TFlops (SP)

Target

Facility
20 feet container(16m2)
•

Cooling Tower：
Water 25~35℃
⇒ Outdoor air

Coolant oil: Spectrasyn8

• World’s top power efficiency (>3GFlops/Watt)
• Average PUE 1.05, lower component power
• Field test ULP‐HPC results

TSUBAME-KFC
Towards TSUBAME3.0 and Beyond
Shooting for #1 on Nov. 2013 Green 500!

Total Mem
BW TB/s
(STREAM)

Mem BW
MByte/S
/ W

Factor

(incl.
cooling)

Linpack Linpack
MFLOPs/
Perf
W
(PF)

10MW
1.8MW

0.036 3.6
0.038 21

13,400 160
2,368 13

16
7.2

ORNL Jaguar
(XT5. 2009Q4)

~9MW

1.76

196

256

432

48

Tsubame2.0
(2010Q4)

1.8MW

1.2

667

75

440

244

K Computer
(2011Q2)

~16MW 10

625

BlueGene/Q
(2012Q1)

~12MW?

17

TSUBAME2.5
(2013Q3)

1.4MW

Tsubame3.0
(2015Q4~2016Q1)

1.5MW

EXA (2019~20)

20MW

Machine

Earth Simulator 1
Tsubame1.0
(2006Q1)

Power

x31.6

3300

206

~1400 ~35

3000

250

~3

~2100 ~24

802

572

~20

~13,000

6000

4000

1000

80

x34

~4

~x20

50,000 1

~x13.7
100K

5000

Extreme Big Data (EBD)
Next Generation Big Data
Infrastructure Technologies Towards
Yottabyte/Year
Principal Invesigator
Satoshi Matsuoka
Global Scientific Information and
Computing Center
Tokyo Institute of Technolgoy

The current “Big Data” are not
really that Big…
• Typical “real” definition: “Mining people’s privacy
data to make money”
• Corporate data are usually in data warehoused silo > limited volume, in Gigabytes~Terabytes, seldom
Petabytes.
• Processing involve simple O(n) algorithms, or those
that can be accelerated with DB-inherited indexing
algorithms
• Executed on re-purposed commodity “web” servers
linked with 1Gbps networks running Hadoop/HDFS
• Vicious cycle of stagnation in innovations…
• NEW: Breaking down of Silos ⇒ Convergence
with Supercomputing with Extreme Big Data

But “Extreme Big Data” will
change everything
• “Breaking down of Silos” (Rajeeb Harza,
Intel VP of Technical Computing)
• Already happening in Science &
Engineering due to Open Data movement
• More complex analysis algorithms: O(n
log n), O(m x n), …
• Will become the NORM for
competitiveness reasons.

We will have tons of unknown genes
[Slide Courtesy Yutaka
Akiyama @ Tokyo Tech.]

Metagenome analysis

• Directly sequencing uncultured microbiomes obtained
from target environment and analyzing the sequence
data
– Finding novel genes from unculturable microorganism
– Elucidating composition of species/genes of environments
Examples of microbiome

Gut microbiome
Human
body

Soil
Sea
62

Results from Akiyama group@Tokyo Tech
Ultra high‐sensitive “big data” metagenome
sequence analysis of human oral microbiome
‐ Required  > 1 million  node*hour product  on K‐computer
‐ World’s most sensitive sequence analysis  (based on amino acid similarity matrix)
‐ Discovered  at least three microbiome clusters with functional differences.
(Integrated 422 experiment samples taken from 9 different oral parts)

572.8 M Reads / hour
82,944 node (663,552 Cores)
K‐computer (2012)

Metabolic Pathway Map

歯列の内側

歯列の外側

歯垢

63

Extreme Big Data in Genomics
Impact of new generation sequencers

[Slide Courtesy Yutaka
Akiyama @ Tokyo Tech.]

Sequencing data (bp)/$
becomes x4000 per 5 years
c.f., HPC x33 in 5 years
64

Lincoln Stein, Genome Biology, vol. 11(5), 2010

Extremely “Big” Graphs
• Large scale graphs in various fields
– US Road network    :    58 million edges
– Twitter follow‐ship : 1.47 billion edges
– Neuronal network :   100 trillion edges
large

Social network

Twitter
61.6 million vertices
&  1.47 billion edges

• Fast and scalable graph processing by using HPC
Neuronal network @ Human Brain Project
89 billion vertices & 100 trillion edges

US road network

Cyber‐security

24 million vertices & 58 million edges

15 billion log entries / day

K computer: 65536nodes
Graph500: 5524GTEPS
# of edges

Human Brain Project

45

Graph500 (Huge)

Symbolic
Network

1 trillion
edges

Graph500 (Large)

log2(m)

40

Graph500 (Medium)

35

Graph500 (Small)

1 billion
edges

30

Twitter (tweets / day)

Graph500 (Mini)
Graph500 (Toy)
USA-road-d.USA.gr

25

USA-road-d.LKS.gr
20

Android tablet
Tegra3 1.7GHz : 1GB RAM
0.15GTEPS: 64.12MTEPS/W

20

25

1 trillion
nodes

1 billion
nodes

USA-road-d.NY.gr
15

USA Road Network

30
log2(n)

35

40

45

# of nodes

Towards Continuous Billion-Scale Social Simulation with
Real-Time Streaming Data (Toyotaro Suzumura/IBM-Tokyo
Tech)
 Applications
– Target Area: Planet (Open Street Map)
– 7 billion people
 Input Data
– Road Network (Open Street Map) for
Planet: 300 GB (XML)
– Trip data for 7 billion people
• 10 KB (1 trip) x 7 billion = 70 TB
– Real-Time Streaming Data (e.g. Social
sensor, physical data)

 Simulated Output for 1 Iteration
–

700 TB

Graph500 “Big Data” Benchmark
Kronecker graph BSP Problem

A: 0.57, B: 0.19
C: 0.19, D: 0.05

November 15, 2010
Graph 500 Takes Aim at a New Kind of HPC
Richard Murphy (Sandia NL => Micron)

“ I expect that this ranking may at times look very
different from the TOP500 list. Cloud architectures
will almost certainly dominate a major chunk of
part of the list.”

Reality: Top500 Supercomputers Dominate
No Cloud IDCs at all
TSUBAME2.0 #3(Nov.2011) #4(Jun.2012)

Supercomputer Tokyo Tech.
Tsubame 2.0
#4 Top500 (2010)

A Major Northern Japanese
Cloud Datacenter (2013)
the Internet

>>

Advanced Silicon
Photonics 40G
single CMOS Die
1490nm DFB
100km Fiber

10GbE

~1500 nodes compute & storage
Full Bisection Multi-Rail
Optical Network
Injection 80GBps/Node
Bisection 220Terabps

Juniper MX480

Juniper MX480

2 zone switches (Virtual Chassis)

x1000!
Juniper
EX4200

10GbE

10GbE

Juniper EX8208

Juniper EX8208

10GbE

LACP
Juniper
EX4200

Zone (700 nodes)

Juniper
EX4200

Juniper
EX4200

Zone (700 nodes)

Juniper
EX4200

Juniper
EX4200

Zone (700 nodes)

8 zones, Total 5600 nodes,
Injection 1GBps/Node
Bisection 160Gigabps

But what does “220Tbps” mean?
Global IP Traffic, 2011-2016 (Source Cicso)
2011

2012

2013

2014

2015

2016

CAGR
2011-2016

By Type (PB per Month / Average Bitrate in Tbps)
Fixed
Internet
Manage
d IP
Mobile
data
Total IP
traffic

23,288
71.9
6,849
21.1
597
1.8
30,734
94.9

32,990
101.8
9,199
28.4
1,252
3.9
43,441
134.1

40,587
125.3
11,846
36.6
2,379
7.3
54,812
169.2

50,888
157.1
13,925
43.0
4,215
13.0
69,028
213.0

64,349 81,347
198.6
251.1
16,085 18,131
49.6
56.0
6,896 10,804
21.3
33.3
87,331 110,282
269.5
340.4

TSUBAME2.0 Network has TWICE
the capacity of the Global Internet,
being used by 2.1 Billion users
NEC Confidential

28%
21%
78%
29%

“convergence” at future extreme scale
for computing and data (in Clouds?)

Source: Assessing trends over
time in performance, costs, and energy
use for servers, Intel, 2009.

HPC: x1000 in 10 years

CAGR ~= 100%

IDC: x30 in 10 years
Server unit sales flat
(replacement demand)

CAGR ~= 30‐40%

What does this all mean?
• “Leveraging of mainframe technologies in HPC has
been dead for some time.”
• But will leveraging Cloud/Mobile be sufficient?
• NO! They are already falling behind, and will be
perpetually behind
– CAGR of Clouds 30%, HPC 100%: all data supports it
– Stagnation in network, storage, scaling, …

• Rather, HPC will be the technology driver for
future Big Data, for Cloud/Mobile to leverage!
– Rather than repurposed standard servers

Future “Extreme Big Data”
- NOT mining Tbytes Silo Data
-

Peta~Zetabytes of Data
Ultra High-BW Data Stream
Highly Unstructured, Irregular
Complex correlations between
data from multiple sources
- Extreme Capacity, Bandwidth,
73
Compute All Required

[Slide courtesy Alok Choudhary, Northeastern
Extreme Big Data not just traditional HPC!!! U
--- Analysis of required system properties
---

74

Extreme-Scale Computing

Big Data Analytics

BDEC Knowledge Discovery Engine

Processor Speed
1
Algorithmic Variety

Memory/ops
0.8
0.6

Power Optimization Opportunities

OPS

0.4
0.2
0

Comm patterns variability

Approximate Computations

Comm Latency tolerance

Write Performance

Local Persistent Storage

Read Performance

EBE Research Scheme

Future Non-Silo Extreme Big Data Apps
Ultra Large Scale
Graphs and Social
Infrastructures

Large Scale
Metagenomics

Co-Design
EBD Bag

Massive Sensors and
Data Assimilation in
Weather Prediction

Co-Design Co-Design

EBD System Software
incl. EBD Object System

Cartesian Plane
KV
S

KV
S

Graph Store
NVM/Fla
NVM/Flas
2Tbps HBM
NVM/Fla
sh
h
4~6HBM Channels NVM/Flas
NVM/Fla
NVM/Flas
sh
h
1.5TB/s DRAM & h
sh
DRAM
DRAM
NVM BW
DRAM
DRAM
DRAM
DRAM
Low
Low
High Powered
30PB/s I/O BW Possible
Main CPU
Power
Power
1 Yottabyte / YearCPU
CPU
TSV Interposer

KV
S

EBD KVS

Exascale Big Data HPC
PCB

Convergent Architecture (Phases 1~4)
Large Capacity NVM, High-Bisection NW

Cloud IDC
Very low BW & Efficiencty

Supercomputers
Compute&Batch-Oriented

Phase4: 2019‐20 DRAM+NVM+CPU
with 3D/2.5D Die Stacking
‐The Ultimate Convergence of BD and EC‐
NVM/Flash
NVM/Flash
NVM/Flash

2Tbps HBM
4~6HBM Channels
1.5TB/s DRAM &
NVM BW

DRAM

NVM/Flash
NVM/Flash
NVM/Flash
DRAM
DRAM

DRAM

30PB/s I/O BW Possible
1 Yottabyte / Year

Low Power CPU

High Powered Main CPU

Low Power CPU

DRAM

TSV Interposer
PCB

DRAM

Preliminary I/O Performance Evaluation
on GPU and NVRAM
How to design local storage for next‐gen supercomputers ?
‐ Designed a local I/O prototype using 16 mSATA SSDs
mSATA

mSATA

mSATA

・Capacity: 4TB
・Read bandwidth: 8 GB/s

mSATA

RAID card

Mother board

I/O performance from GPU to multiple mSATA SSDs

I/O performance of multiple mSATA SSD

9000
7000

3
Throughuput [GB/s]

Bandwidth [MB/s]

3.5

Raw mSATA 4KB
RAID0 1MB
RAID0 64KB

8000
6000
5000
4000
3000
2000

〜 7.39 GB/s from

1000

16 mSATA SSDs (Enabled RAID0)

0

Raw 8 mSATA
8 mSATA RAID0 (1MB)
8 mSATA RAID0 (64KB)

2.5
2
1.5
1
0.5

〜 3.06 GB/s from
8 mSATA SSDs to GPU

0
0

5

10
# mSATAs

15

20

0.2740.547 1.09 2.19 4.38 8.75 17.5 35

Matrix Size [GB]

70

140

Algorithm Kernels on EBD

Large Scale BFS Using NVRAM

1. Introduction
• Large scale graph processing in
various domains
DRAM resources has increased

• Spread of Flash Devices
Prof : Price per bit，Energy consumption
Cons: Latency，Throughput

Using NVRAMs for large scale graph processing has possibilities of
minimum performance degradation
2.Hybrid‐BFS

3.Proporsal
① offload small accesses data

Switch two approaches

Top‐down

Bottom‐up

# of frontiers:nfrontier， # of all vertices:nall,           parameter : α, β

GTEPS

4.Evaluation

6.0
5.0
4.0
3.0
2.0
1.0
0.0

5.2GTEPS

② BFS with reading data
from NVRAM

DRAM Only(β=10α) ● Pearce, et al. :  13 times larger datasets
DRAM+SSD(β=0.1α) with  52 MTEPS(DRAM 1TB, 12TB NVRAM)

2.8GTEPS
（47.1% down）
1.E+04 1.E+05 1.E+06 1.E+07
Switching Parameter α

● We could reduce half the size of DRAM
with 47.1% performance degradation
(130M vertices，2.1G edges）
● We are working on multiplexed I/O
→ multiplexed I/O improve  NVRAM’s I/O
performance

High Performance Sorting
Fast algorithms:
Distribution vs Comparison-based
N log(N) classical sorts
(quick, merge etc)

MSD radix sort

LSD radix sort

(THRUST)
variable length /
short length /
long keys
high efficiency on small fixed-length keys
alphabets
apple
apricot
banana
kiwi

Scalability
Bitonic sort

Computational
don't have to examine
Genomics
all characters
(A,C,G,T)

Comparison of keys

Map-Reduce
Hadoop easy to use but
not
that efficient

integer sorts

Efficient
implementation
GPUs are good
at counting
numbers

Hybrid approaches/
Best to be found

Good for GPU nodes
Balancing IO / computation

–

Twitter network (Application of Graph500 Benchmark)
Follow‐ship network 2009

Frontier size in BFS
with source as User 21,804,357

User j
Lv
User i

(i, j)‐edge

41 million vertices and 2.47 billion edges

Our NUMA‐optimized BFS
on 4‐way Xeon system

69 ms / BFS
⇒ 21.28 GTEPS
Six‐degrees of separation

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Total

Frontier size
1
7
6,188
510,515
29,526,508
11,314,238
282,456
11536
673
68
19
10
5
2
2
2
41,652,230

Freq. (%) Cum. Freq. (%)
0.00
0.00
0.01
1.23
70.89
27.16
0.68
0.03
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
100.00

0.00
0.00
0.01
1.24
72.13
99.29
99.97
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
100.00
‐

100,000 Times Fold EBD “Convergent” System Overview
Tasks 5‐1~5‐3

EBD Application Co‐
Design and Validation

Task 3
EBD Programming System
Task 2

Graph Store

Task 1

Data Assimilation
in Large Scale Sensors
and Exascale
Atmospherics

Task 4
EBD “converged” Real‐Time
Resource Scheduling

EBD Distrbuted Object Store on
100,000 NVM Extreme Compute
and Data Nodes
Ultra Parallel & Low Powe I/O EBD
“Convergent” Supercomputer
~10TB/s⇒~100TB/s⇒~10PB/s

Ultra High BW & Low Latency NVM
TSUBAME 2.0/2.5

EBD Performance Modeling
& Evaluation

Large Scale Graphs
and Social
Infrastructure Apps

Large Scale
Genomic
Correlation

EBD Bag

Task6

TSUBAME 3.0

Cartesian Plane
KVS

KVS
KVS

EBD KVS

Ultra High BW & Low Latency NW
Processor‐in‐memory

3D stacking

Summary

• TSUBAME1.0->2.0->2.5->3.0->…

– Tsubame 2.5 Number 1 in Japan, 17 Petaflops SFP
– Template for future supercomputers and IDC machines

• TSUBAME3.0 Early 2016

– New supercomputing leadership
– Tremendous power efficiency, extreme big data,
extremely high reliability

• Lots of background R&D for TSUBAME3.0 and
towards Exascale
–
–
–
–
–

Green Computing: ULP-HPC & TSUBAME-KFC
Extreme Big Data – Convergence of HPC and IDC!
Exascale Resilience
Programming with Millions of Cores
…

• Please stay tuned! 乞うご期待。応援をお願いします。

[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data

Similar to [RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data (20)

More from Rakuten Group, Inc.

More from Rakuten Group, Inc. (20)

Recently uploaded

Recently uploaded (20)

[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data