SlideShare a Scribd company logo
1 of 82
Download to read offline
TSUBAME2.5 to 3.0 and
Convergence with Extreme
Big Data
Satoshi Matsuoka
Professor
Global Scientific Information and Computing (GSIC) Center
Tokyo Institute of Technology
Fellow, Association for Computing Machinery (ACM)
Rakuten Technology Conference 2013
2013/10/26
Tokyo, Japan
Supercomputers from the Past

Fast, Big, Special, Inefficient,
Evil device to conquer the world…
Let us go back to the mid ’70s
Birth of “microcomputers” and arrival of
commodity computing (start of my career)

•

Commodity 8-bit CPUs…

– Intel 4004/8008/8080/8085,
Zilog Z-80, Motorola 6800, MOS
Tech. 6502, …

•

Lead to hobbyist computing…
– Evaluation boards: Intel SDK-80,
Motorola MEK6800D2, MOS Tech.
KIM-1, (in Japan) NEC TK-80,
Fujitsu Lkit-8, …
– System Kits: MITS Altair
8800/680b, IMSAI 8080, Proc.
Tech. SOL-20, SWTPC 6800, …

•

& Lead to early personal computers
– Commodore PET, Tandy TRS-80,
Apple II
– (in Japan): Hitachi Basic Master,
NEC CompoBS / PC8001, Fujitsu
FM-8, …
Supercomputing vs. Personal Computing
in the late 1970s.
• Hitachi Basic Master
(1978)

– “The first PC in Japan”
– Motorola 6802--1Mhz,
16KB ROM, 16KB RAM
– Linpack in BASIC: Approx.
70-80 FLOPS (1/1,000,000)

• We got “simulation” done
(in assembly language)
– Nintendo NES (1982)
• MOS Technology 6502
1Mhz (Same as Apple II)

– “Pinball” by Matsuoka &
Iwata (now CEO Nintendo)
• Realtime dynamics +
collision + lots of shortcuts

• Average ~a few KFLOPS

Cf. Cray-1
Running Linpack 10
(1976)
Linpack
80-90MFlops
(est.)
Then things got accelerated
around the mid 80s to mid 90s
(rapid commoditization towards what we use now)
• PC CPUs: Intel 8086/286/386/486/Pentium (Superscalar&fast FP
x86), Motorola 68000/020/030/040, … to Xeons, GPUs, Xeon Phi’s
– C.f. RISCs: SPARC, MIPS, PA-RISC, IBM Power, DEC Alpha, …

• Storage Evolution: Cassettes, Floppies to HDDs, optical disk to Flash
• Network Evolution: RS-232C to Ethernet now to FDR Infinininband
• PC (incl. I/O): IBM PC “Clones” and Macintoshes: ISA to VLB to PCIe
• Software Evolution: CP/M to MS-DOS to Windows, Linux,
• WAN Evolution: RS-232+Modem+BBS to Modem+Internet to
ISDN/ADSL/FTTH broadband, DWDM Backbone, LTE, …
• Internet Evolution: email + ftp to Web, Java, Ruby, …

• Then Clusters, Grid/Clouds, 3-D Gaming, and
Top500 all started in the mid 90s(!), and
commoditized supercomputing
Modern Day Supercomputers

 Now supercomputers “look like” IDC

servers
 High-End COTS dominate

Linux based machine with standard + HPC OSS Software Stack
NEC Confidential
1957

2010

“Reclaimed No.1 Supercomputer 
Rank in the World”

2011

2012
7
Top Supercomputers vs. Global IDC

K Computer (#1 2011-12) Riken-AICS
Fujitsu Sparc VIII-fx Venus CPU
88,000 nodes, 800,000CPU cores
~11 Petaflops (1016)
1.4 Petabyte memory, 13 MW Power
864 racks、3000m2

Tianhe2 (#1 2013) China Gwanjou
48,000 KNC Xeon Phi + 36,000 Ivy
Bridge Xeon
18,000 nodes, >3 Million CPU cores
54 Petaflops (1016)
0.8 Petabyte memory, 20 MW Power
??? racks、???m2

C.f. Amazon ~= 450,000 Nodes, ~3 million Cores
#1 2012 IBM BlueGene/Q “Sequoia”
Lawrence Livermore National Lab
DARPA study
IBM PowerPC System-On-Chip
98,000 nodes, 1.57million Cores 2020 Exaflop (1018)
~20 Petaflops
100 million~
1.6 Petabytes, 8MW, 96 racks
NEC Confidential
1 Billion Cores
Scalability and Massive Parallelism

 More nodes & core => Massive Increase in

parallelism
Faster, “Bigger” Simulation
Qualitative Difference

Performance

BAD!

GOOD!

BAD!

Ideal Linear
Scaling Difficult
to Achieve

Limitations
in Power,
Cost,
Reliability
Limitations
in Scaling

CPU Cores ~= Parallelism

NEC Confidential
TSUBAME2.0
2006: TSUBAME1.0
as No.1 in Japan
All University Centers
COMBINED 45 TeraFlops

>
Total 85 TeraFlops,
#7 Top500 June 2006

Earth Simulator
40TeraFlops #1 2002~2004
TSUBAME2.0 Nov. 1, 2010
“The Greenest Production Supercomputer in the World”

TSUBAME 2.0
New Development

32nm

40nm

>12TB/s Mem BW
>400GB/s Mem BW >1.6TB/s Mem BW
35KW Max
80Gbps NW BW
~1KW max
12

>600TB/s Mem BW
220Tbps NW
Bisecion BW
1.4MW Max
1500
1250
1000
750
500

CPU
250
0

GPU

GPU

Memory Bandwidth [GByte/s]

Peak Performance [GFLOPS]

1750

Performance Comparison of
CPU vs. 200
GPU
160
120
80

CPU
40
0

x5-6 socket-to-socket advantage in both
compute and memory bandwidth,
Same power
(200W GPU vs. 200W CPU+memory+NW+…)
TSUBAME2.0 Compute Node
Thin
Node
Infiniband QDR
x2 (80Gbps)

1.6 Tflops
400GB/s
Mem BW
80GBps NW
~1KW max

Productized
as HP
ProLiant

SL390s

HP SL390G7 (Developed for
TSUBAME 2.0)

GPU: NVIDIA Fermi M2050 x 3
515GFlops, 3GByte memory /GPU
CPU: Intel Westmere-EP 2.93GHz x2
(12cores/node)
Multi I/O chips, 72 PCI-e (16 x 4 + 4
x 2) lanes --- 3GPUs + 2 IB QDR
Memory: 54, 96 GB DDR3-1333
SSD:60GBx2, 120GBx2

NEC Confidential

Total Perf
2.4PFlops
Mem: ~100TB
SSD: ~200TB

4-1
TSUBAME2.0 Storage Overview
TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)
Infiniband QDR Network for LNET and Other Services
QDR IB (×4) × 8

QDR IB(×4) × 20
GPFS#1

SFA10k #1

SFA10k #2

/work9

“Global Work Space” #1

GPFS with HSM

SFA10k #3

SFA10k #4

SFA10k #5

/work0

/work19

/gscr0

“Global Work
Space” #2

“Global Work
Space” #3

Lustre

“Scratch”

3.6 PB 30~60GB/s

GPFS#2

GPFS#3

10GbE × 2

GPFS#4

HOME

HOME
System
application

iSCSI

SFA10k #6

“cNFS/Clusterd Samba w/ GPFS”

“NFS/CIFS/iSCSI by
BlueARC”

Home Volumes

1.2PB

Parallel File System Volumes

2.4 PB HDD +
〜4PB Tape

“Thin node SSD”

“Fat/Medium node SSD”

250 TB, 300~500GB/s
Scratch

130 TB=> 500TB~1PB
Grid Storage
TSUBAME2.0 Storage Overview
TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)
Infiniband QDR Network for LNET and Other Services
QDR IB (×4) × 8

QDR IB(×4) × 20
GPFS#1

Concurrent Parallel I/O
(e.g. MPI-IO)
SFA10k #1

SFA10k #2

/work9

SFA10k #3

SFA10k #4

SFA10k #5

/work0

/work19

/gscr0

Read mostly I/O
(data-intensive apps, parallel workflow,
“Global Work
“Global Work
parameterSpace” #1
survey)
Space” #2
“Global Work
Space” #3
GPFS with HSM

“Scratch”

Lustre 3.6
Fine-grained R/W PB
I/O
Parallel File System Volumes
(checkpoints, temporary files,
Big Data processing)

GPFS#2

GPFS#3

10GbE × 2

GPFS#4

• Home storage for computing nodes
•HOME
Cloud-based campus storage HOME
services
System
application

iSCSI

SFA10k #6

“cNFS/Clusterd Samba w/ GPFS”

“NFS/CIFS/iSCSI by
BlueARC”

Home Volumes

1.2PB

Data transfer service
between SCs/CCs
2.4Long-Term
PB HDD +
Backup
〜4PB Tape

“Thin node SSD”

“Fat/Medium node SSD”

250 TB, 300GB/s
Scratch

130 TB=> 500TB~1PB
HPCI Storage
3500 Fiber Cables > 100Km
w/DFB Silicon Photonics
End-to-End 7.5GB/s, > 2us
Non-Blocking 200Tbps Bisection

NEC Confidential
2010: TSUBAME2.0 as No.1 in Japan

>
Total 2.4 Petaflops
#4 Top500, Nov. 2010

All Other Japanese
Centers on the Top500
COMBINED 2.3 PetaFlops
TSUBAME Wins Awards…
“Greenest Production
Supercomputer in the
World”
the Green 500
Nov. 2010, June 2011
(#4 Top500 Nov. 2010)

3 times more power
efficient than a laptop!
TSUBAME Wins Awards…

ACM Gordon Bell Prize 2011
2.0 Petaflops Dendrite Simulation
Special Achievements in Scalability and Time-to-Solution
“Peta-Scale Phase-Field Simulation for Dendritic
Solidification on the TSUBAME 2.0 Supercomputer”
TSUBAME Wins Awards…

Commendation for Sci &Tech by
Ministry of Education 2012
(文部科学大臣表彰)
Prize for Sci & Tech, Development Category
Development of Greenest Production Peta-scale Supercomputer

Satoshi Matsuoka, Toshio Endo, Takayuki Aoki
Precise Bloodflow Simulation of Artery on
TSUBAME2.0
(Bernaschi et. al., IAC-CNR, Italy)
Personal CT Scan + Simulation
=> Accurate Diagnostics of Cardiac Illness
5 Billion Red Blood Cells + 10 Billion Degrees of
Freedom
MUPHY: Multiphysics simulation of blood flow
(Melchionna, Bernaschi et al.)
Combined Lattice-Boltzmann (LB)
simulation for plasma and Molecular
Dynamics (MD) for Red Blood Cells
Realistic geometry ( from CAT scan)
Multiphyics simulation
with MUPHY software
Fluid: Blood plasma
Lattice Boltzmann

Body: Red blood cell
coupled

Irregular mesh is divided by
using PT-SCOTCH tool,
considering cutoff distance

Extended MD

Red blood cells
(RBCs) are
represented as
ellipsoidal particles

Two-levels of parallelism: CUDA (on
GPU) + MPI
• 1 Billion mesh node for LB
component
•100 Million RBCs 4000 GPUs,

ACM
Gordon Bell
Prize 2011
Honorable
Mention

0.6Petaflops
Lattice-Boltzmann-LES with
Coherent-structure SGS model
[Onodera&Aoki2013]
Coherent-structure Smagorinsky model
Second invariant of the velocity gradient
tensor(Q) and
the
Energy dissipation(Îľ)

The model parameter is locally determined by
second invariant of the velocity gradient tensor.

◎ Turbulent flow around a complex
object
◎ Large-scale parallel computation
Copyright Š Global Scientific Information and Computing Center, Tokyo Institute of Technology
Computational Area – Entire
Downtown Tokyo
Major part of Tokyo
Including Shnjuku-ku,
Chiyoda-ku, Minato-ku,
Meguro-ku, Chuou-ku,

Shinjyuku

Tokyo

10km×10km
Building Data:
Pasco Co. Ltd.
TDM 3D
Achieved 0.592 Petaflops
using over 4000 GPUs
(15% efficiency)

Shibuya

Shinagawa

Map Š2012 Google, ZENRIN

Copyright Š Global Scientific Information and Computing Center, Tokyo Institute of Technology

25
Copyright Š Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
Area Around
Metropolitan Government Building
Flow profile at the 25m height on the ground

Wind

640 m

960 m

地図データ

Š2012 Google, ZENRIN

Copyright Š Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology

27
Copyright Š Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology

28
Current Weather Forecast
5km Resolution
(Inaccurate Cloud Simulation)

ASUCA Typhoon Simulation on TSUBAME2.0
500m Resolution 4792×4696×48 , 437 GPUs
(x1000 resolution)

29
CFD analysis over a car body
Calculation conditions

* Number of grid points:
3,623,878,656
(3,072 × 1,536 × 768)
*Grid resolution:4.2mm
(13m x 6.5 m x 3.25m)
*Number of GPUs:
288 (96 Nodes)

60 km/h
LBM: DriVer: BMW-Audi
Lehrstuhl fĂźr Aerodynamik und StrĂśmungsmechanik
Technische Universität Mßnchen

3,000x1,500x1,500
Re = 1,000,000

32
33
34
Industry prog.: TOTO INC.
TSUBAME 150 GPUs

In-House Cluster
アステラス製薬とのデング熱等の熱帯病の
特効薬の創薬

Accelerate In‐
silico screeninig
and data mining
100‐million‐atom MD Simulation

M. Sekijima (Tokyo Tech), Jim Phillips (UIUC)
Mixed Precision Amber on Tsubame2.0 
for Industrial Drug Discovery

x10 faster
Mixed‐Precision
ヌクレオソーム (25095 粒子)

$500mil~$1bil dev. cost per
drug
Even 5-10% improvement of
the process will more than
pay for TSUBAME

75% Energy Efficient
Towards TSUBAME 3.0
Interim Upgrade TSUBAME2.0 to 2.5
(Early Fall 2013)

• Upgrade the TSUBAME2.0s GPUs
NVIDIA Fermi M2050 to Kepler K20X
SFP/DFP peak from
4.8PF/2.4PF => 17PF/5.7PF

TSUBAME2.0 Compute Node
Fermi GPU 3 x 1408 = 4224 GPUs

c.f. The K Computer 11.2/11.2
Acceleration of Important Apps
Considerable Improvement
Summer 2013

Significant Capacity
Improvement at low cost
& w/o
Power Increase
TSUBAME3.0 2H2015
TSUBAME2.0⇒2.5 Thin Node Upgrade
Thin
Node
Infiniband QDR
x2 (80Gbps)

Peak Perf.

4.08 Tflops
~800GB/s
Mem BW
80GBps NW
~1KW max

HP SL390G7 (Developed for
TSUBAME 2.0, Modified for 2.5)
GPU: NVIDIA Kepler K20X x 3
1310GFlops, 6GByte Mem(per GPU)

Productized
as HP
ProLiant

SL390s
Modified for
TSUABME2.5

CPU: Intel Westmere-EP 2.93GHz x2
Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2)
lanes --- 3GPUs + 2 IB QDR
Memory: 54, 96 GB DDR3-1333
SSD:60GBx2, 120GBx2

NVIDIA Fermi 
M2050
1039/515
GFlops

NVIDIA Kepler
K20X
3950/1310
GFlops
2013: TSUBAME2.5 No.1 in Japan in
Single Precision FP, 17 Petaflops

All University Centers
COMBINED 9 Petaflops SFP

~=
Total
17.1 Petaflops SFP
5.76 Petaflops DFP

K Computer
11.4 Petaflops SFP/DFP
TSUBAME2.0
TSUBAME2.5
Thin Node x 1408 Units
Node Machine
CPU

HP Proliant SL390s
Intel Xeon X5670

← No Change
← No Change

(6core 2.93GHz, Westmere) x 2

GPU

Node
Performance
(incl. CPU
Turbo boost)

NVIDIA Tesla M2050 x 3
 448 CUDA cores (Fermi)
 SFP 1.03TFlops
 DFP 0.515TFlops
 3GiB GDDR5 memory
 150GB Peak, ~90GB/s
STREAM Memory BW
 SFP 3.40TFlops
 DFP 1.70TFlops
 ~500GB Peak, ~300GB/s
STREAM Memory BW

NVIDIA Tesla K20X x 3
 2688 CUDA cores (Kepler)
 SFP 3.95TFlops
 DFP 1.31TFlops
 6GiB GDDR5 memory
 250GB Peak, ~180GB/s
STREAM Memory BW
 SFP 12.2TFlops
 DFP 4.08TFlops
 ~800GB Peak, ~570GB/s
STREAM Memory BW

TOTAL System
Total System
Performance

 SFP 4.80PFlops
 DFP 2.40PFlops
 Peak ~0.70PB/s, STREAM
~0.440PB/s Memory BW

 SFP 17.1PFlops (x3.6)
 DFP 5.76PFlops (x2.4)
 Peak ~1.16PB/s, STREAM
~0.804PB/s Memory BW (x1.8)
Phase‐field simulation for Dendritic Solidification
[Shimokawabe, Aoki et. al.]

Weak scaling on TSUBAME (Single precision)
Mesh size(1GPU+4 CPU cores):4096 x 162 x 130
TSUBAME 2.5
3.444 PFlops
(3,968 GPUs+15,872 CPU cores)
4,096 x 5,022 x 16,640

Developing lightweight strengthening 
material by controlling microstructure

TSUBAME 2.0
2.000 PFlops
(4,000 GPUs+16,000 CPU cores)

Low‐carbon society
4,096 x 6,480 x 13,000

•
•

Peta‐Scale phase‐field simulations can simulate the multiple dendritic growth during 
solidification required for the evaluation of new materials.
2011 ACM Gordon Bell Prize Special Achievements in Scalability and Time‐to‐Solution
Peta‐scale stencil application :
A Large‐scale LES Wind Simulation using Lattice Boltzmann Method
[Onodera, Aoki et. al.]
Weak scalability in single precision

Large-scale Wind Simulation for a
10km x 10km Area in Metropolitan Tokyo

(N = 192 x 256 x 256)

Performance [TFlops]

10,080 x 10,240 x 512  (4,032 GPUs)

The above peta‐scale simulations were executed as the 
TSUBAME Grand Challenge Program, Category A in 2012 fall. 

•
•

▲ TSUBAME 2.5 (overlap)
● TSUBAME 2.0 (overlap)
TSUBAME 2.5
1142 TFlops (3968 GPUs)
288 GFlops / GPU

x 1.93

TSUBAME 2.0
149 TFlops (1000 GPUs)
149 GFlops / GPU

Number of GPUs

The LES wind simulation for the area 10km × 10km with 1‐m resolution has never 
been done before in the world. 
We achieved 1.14 PFLOPS using 3968 GPUs on the TSUBAME 2.5 supercomputer.
AMBER pmemd benchmark
Nucleosome = 25,095 atoms
11.39

K20X×8

6.66

K20X×4

4.04

K20X×2

3.11
3.44

K20X×1
M2050×8
M2050×4
M2050×2

2.22
1.85

0.99
0.31
MPI 4node
0.15
MPI 2node
0.11
M2050×1

MPI 1node
(12 core) 0

Dr.Sekijima@Tokyo Tech

2

TSUBAME2.0 M2050
TSUBAME2.5 K20X

4

6

ns/day

8

10

12
Application

TSUBAME2.0
Performance

TSUBAME2.5
Performance

Boost 
Ratio

Top500/Linpack
(PFlops)

1.192

2.843

2.39

Green500/Linpack
(GFlops/W)

0.958

> 2.400

> 2.50

Semi‐Definite Programming 
Nonlinear Optimization (PFlops)

1.019

1.713

1.68

Gordon Bell Dandrite Stencil 
(PFlops)

2.000

3.444

1.72

LBM LES Whole City Airflow 
(PFlops)

0.600

1.142

1.90

Amber 12 pmemd 4 nodes 8 
GPUs (nsec/day)

3.44

11.39

3.31

GHOSTM Genome Homology 
Search (Sec)

19361

10785

1.80

MEGADOC Protein Docking (vs. 
1CPU core)

37.11

83.49

2.25
TSUBAME Evolution
Towards Exascale and Extreme Big Data

25‐30PF
1TB/s

5.7PF
Graph 500
No. 3 (2011)

Awards

3.0

Phase2
Fast I/O
2.5
5~10PB
Phase1
10TB/s
Fast I/O
> 100mil 
250TB
300GB/s iOPs
30PB/Day 1ExaB/Day
2015H2

Copyright Š Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology

47
x

DoE Exascale Parameters
x1000 power efficiency in 10 years
System
attributes

“2010”

“2015”

System peak

2 PetaFlops

100-200 PetaFlops 1 ExaFlop

Power

Jaguar

TSUBAME

6 MW

1.3 MW

15 MW

20 MW

System Memory

0.3PB

0.1PB

5 PB

32-64PB

Node Perf

125GF

1.6TF

0.5TF

Node Mem BW

25GB/s

Node Concurrency

#Nodes

1TF

10TF

0.5TB/s 0.1TB/s 1TB/s

0.4TB/s

4TB/s

12

O(1000) O(100)

O(1000)

O(1000)

18,700

1442

50,000

5,000

1 million

8GB/s

20GB/s

200GB/s

O(1 day)

O(1 day)

Total Node
1.5GB/s
Interconnect BW
MTTI

“2020”

O(days)

7TF

O(10000
)
Billion Cores

100,000
Challenges of Exascale (FLOPS, Byte, …) (1018)!
Various Physical Limitations Surface All‐at‐Once

• # CPU Cores: 1Bil
Low Power
• # Nodes 100K~xM

c.f. Total # of Smartphones sold
globally = 400Mil
c.f. The K Computer ~100K
Google ~ 1 Mil
• Mem: x00PB~ExaB
c.f. Total mem all PCs (300Mil) 
shipped globally in 2011 ~ ExaB
BTW 264~=1.8x1019=18ExaB
• Storage: xExaB c.f. Google Storage 
2 Exabytes (200Mil x 7GB+) 
• All of this at 20MW (50GFlops/W), reliability (MTTI=days), 
ease of programming (billion cores?), cost… in 2020?!
49
Focused Research Towards

Tsubame 3.0 and Beyond towards Exa
• Green Computing: Ultra Power Efficient HPC
• High Radix Bisection Networks – HW, Topology, Routing
Algorithms, Placement…
• Fault Tolerance – Group-based Hierarchical
Checkpointing, Fault Prediction, Hybrid Algorithms
• Scientific “Extreme” Big Data – Ultra Fast I/O, Hadoop
Acceleration, Large Graphs
• New memory systems – Pushing the envelops of low
Power vs. Capacity vs. BW, exploit the deep hierarchy
with new algorithms to decrease Bytes/Flops
• Post Petascale Programming – OpenACC and other manycore programming substrates, Task Parallel
• Scalable Algorithms for Many Core –
Apps/System/HW Co-Design
JST‐CREST “Ultra Low Power (ULP)‐HPC” Project 
2007‐2012
Ultra Multi-Core
Slow & Parallel
(& ULP)

Auto‐Tuning for Perf. & Power
ULP‐HPC SIMD‐
Vector
(GPGPU, etc.)

ABCLibScript: アルゴリズム選択

モデルと実測の Bayes 的融合

実行起動前自動チューニング指定、

• Bayes モデルと事前分布

アルゴリズム選択処理の指定
!ABCLib$ static select region start
!ABCLib$ parameter (in CacheS, in NB, in NPrc)
コスト定義関数で使われる
!ABCLib$
select sub region start
入力変数
!ABCLib$
according estimated
!ABCLib$
(2.0d0*CacheS*NB)/(3.0d0*NPrc)

モデルによる
所要時間の推定

yi ~ N (  i ,  )
2
i

i |  ,  i2 ~ N ( xiT  ,  i2 /  0 )

コスト定義関数

 i2 ~ Inv -  2 (v0 ,  02 )

ULP‐HPC
Networks

MRAM
PRAM
Flash
etc.

所要時間の実測データ

• n 回実測後の事後予測分布

yi | ( yi1 , yi 2 , , yin ) ~ t n ( in ,   n 1 /  n )
2
in

0

 n   0  n,  n   0  n,  n  ( 0 xiT   nyi ) /  n

Low Power 
High Perf Model

Power Optimize using Novel Components
in HPC

2
 n n   0 02   ( ym  yi ) 2   0 n( yi  xiT  ) 2 /  n

対象1(アルゴリズム1)
select sub region end
select sub region start
according estimated
(4.0d0*ChcheS*dlog(NB))/(2.0d0*NPrc)
対象2(アルゴリズム2)
!ABCLib$ select sub region end
!ABCLib$ static select region end
!ABCLib$
!ABCLib$
!ABCLib$
!ABCLib$

1

1
yi   yim
n

Optimization 
Point
Power

Perf

x10 Power 
Efficiencty

0

Perf. Model
Algorithms
Power‐Aware and Optimizable Applications

対象領域1、2

x1000
Improvement in 10 years
Aggressive Power Saving in HPC
Methodologies

Enterprise/Business
Clouds

HPC

Server Consolidation

Good

NG!

Good

Poor

Poor

Good

Poor

Good

Limited

Good

DVFS

(Dynamic Voltage/Frequency
Scaling)

New Devices
New HW &SW
Architecture

Novel Cooling

(Cost & Continuity)

(Cost & Continuity)

(Cost & Continuity)

(high thermal density)
How do we achive x1000?
Process Shrink x100
X
Many-Core GPU Usage x5
X
DVFS & Other LP SW x1.5
X
Efficient Cooling x1.4
x1000 !!!

ULP-HPC
Project
2007-12
Ultra Green
Supercomputing
Project 2011-15
Statistical Power Modeling of GPUs
[IEEE IGCC10]
GPU performance counters

n

p    i ci  
i 1

• Estimates GPU power consumption GPU 
statistically
• Linear regression model using performance 
counters as explanatory variables

High accuracy(Avg Err 4.7%)

Average power consumption
• Prevents overtraining by ridge regression
• Determines optimal parameters by cross 
fitting

Power meter with
high resolution

Accurate even with DVFS
Future:  Model‐based power opt.
Linear model shows sufficient accuracy
Possibility of optimization of Exascale systems 
with O(10^8) processors
Power Efficiency in Denderite Applications
on TSUBAME1.0 thru JST‐CREST ULPHPC 
Prototype running Gordon Bell Denderite App 
TSUBAME‐KFC: Ultra‐Green Supercomputer Testbed
[2011‐2015]
Fluid Submersion Cooling + Outdoor Air Cooling + High 
Density GPU Supercomputing in a 20‐feet container
Compute Nodes
K20X GPU

GRC Submersion Rack Heat Exchanger
Processors 80~90℃
Oil 35~45℃
⇒ Coolant oil 35~45℃ ⇒ Water 25~35℃

Heat
Dissipation

NEC/SMC 1U server x 40
• Intel IvyBridge 2.1GHz 6core×2
• NVIDIA Tesla K20X GPU ×4
• DDR3 memory 64GB, SSD 120GB
• 4x FDR InfiniBand 56Gbps
Per node

Total Peak
210TFlops (DP)
630TFlops (SP)

Target

Facility
20 feet container(16m2)
•

Cooling Tower:
Water 25~35℃
⇒ Outdoor air

Coolant oil: Spectrasyn8

• World’s top power efficiency (>3GFlops/Watt)
• Average PUE 1.05, lower component power
• Field test ULP‐HPC results
TSUBAME-KFC
Towards TSUBAME3.0 and Beyond
Shooting for #1 on Nov. 2013 Green 500!
Total Mem
BW TB/s
(STREAM)

Mem BW
MByte/S
/ W

Factor

(incl. 
cooling)

Linpack Linpack
MFLOPs/
Perf
W
(PF)

10MW
1.8MW

0.036 3.6
0.038 21

13,400 160
2,368 13

16
7.2

ORNL Jaguar
(XT5. 2009Q4)

~9MW

1.76

196

256

432

48

Tsubame2.0
(2010Q4)

1.8MW

1.2

667

75

440

244

K Computer
(2011Q2)

~16MW 10

625

BlueGene/Q
(2012Q1)

~12MW?

17

TSUBAME2.5
(2013Q3)

1.4MW

Tsubame3.0
(2015Q4~2016Q1)

1.5MW

EXA (2019~20)

20MW

Machine

Earth Simulator 1
Tsubame1.0
(2006Q1)

Power 

x31.6

3300

206

~1400 ~35

3000

250

~3

~2100 ~24

802

572

~20

~13,000

6000

4000

1000

80

x34

~4

~x20

50,000 1

~x13.7
100K

5000
Extreme Big Data (EBD)
Next Generation Big Data
Infrastructure Technologies Towards
Yottabyte/Year
Principal Invesigator
Satoshi Matsuoka
Global Scientific Information and
Computing Center
Tokyo Institute of Technolgoy
The current “Big Data” are not
really that Big…
• Typical “real” definition: “Mining people’s privacy
data to make money”
• Corporate data are usually in data warehoused silo > limited volume, in Gigabytes~Terabytes, seldom
Petabytes.
• Processing involve simple O(n) algorithms, or those
that can be accelerated with DB-inherited indexing
algorithms
• Executed on re-purposed commodity “web” servers
linked with 1Gbps networks running Hadoop/HDFS
• Vicious cycle of stagnation in innovations…
• NEW: Breaking down of Silos ⇒ Convergence
with Supercomputing with Extreme Big Data
But “Extreme Big Data” will
change everything
• “Breaking down of Silos” (Rajeeb Harza,
Intel VP of Technical Computing)
• Already happening in Science &
Engineering due to Open Data movement
• More complex analysis algorithms: O(n
log n), O(m x n), …
• Will become the NORM for
competitiveness reasons.
We will have tons of unknown genes
[Slide Courtesy Yutaka 
Akiyama @ Tokyo Tech.]

Metagenome analysis

• Directly sequencing uncultured microbiomes obtained 
from target environment and analyzing the sequence 
data
– Finding novel genes from unculturable microorganism
– Elucidating composition of species/genes of environments
Examples of microbiome

Gut microbiome
Human 
body

Soil
Sea
62
Results from Akiyama group@Tokyo Tech
Ultra high‐sensitive “big data” metagenome
sequence analysis of human oral microbiome
‐ Required  > 1 million  node*hour product  on K‐computer
‐ World’s most sensitive sequence analysis  (based on amino acid similarity matrix)
‐ Discovered  at least three microbiome clusters with functional differences.
(Integrated 422 experiment samples taken from 9 different oral parts)

572.8 M Reads / hour 
82,944 node (663,552 Cores)
K‐computer (2012)

Metabolic Pathway Map

歯列の内側

歯列の外側

歯垢

63
Extreme Big Data in Genomics
Impact of new generation sequencers

[Slide Courtesy Yutaka 
Akiyama @ Tokyo Tech.]

Sequencing data (bp)/$
becomes x4000 per 5 years
c.f., HPC x33 in 5 years
64

Lincoln Stein, Genome Biology, vol. 11(5), 2010
Extremely “Big” Graphs
• Large scale graphs in various fields
– US Road network    :    58 million edges
– Twitter follow‐ship : 1.47 billion edges
– Neuronal network :   100 trillion edges
large

Social network

Twitter
61.6 million vertices
&  1.47 billion edges

• Fast and scalable graph processing by using HPC
Neuronal network @ Human Brain Project
89 billion vertices & 100 trillion edges

US road network

Cyber‐security

24 million vertices & 58 million edges

15 billion log entries / day
K computer: 65536nodes
Graph500: 5524GTEPS
# of edges

Human Brain Project

45

Graph500 (Huge)

Symbolic
Network

1 trillion
edges

Graph500 (Large)

log2(m)

40

Graph500 (Medium)

35

Graph500 (Small)

1 billion
edges

30

Twitter (tweets / day)

Graph500 (Mini)
Graph500 (Toy)
USA-road-d.USA.gr

25

USA-road-d.LKS.gr
20

Android  tablet
Tegra3 1.7GHz : 1GB RAM
0.15GTEPS: 64.12MTEPS/W

20

25

1 trillion
nodes

1 billion
nodes

USA-road-d.NY.gr
15

USA Road Network

30
log2(n)

35

40

45

# of nodes
Towards Continuous Billion-Scale Social Simulation with
Real-Time Streaming Data (Toyotaro Suzumura/IBM-Tokyo
Tech)
 Applications
– Target Area: Planet (Open Street Map)
– 7 billion people
 Input Data
– Road Network (Open Street Map) for
Planet: 300 GB (XML)
– Trip data for 7 billion people
• 10 KB (1 trip) x 7 billion = 70 TB
– Real-Time Streaming Data (e.g. Social
sensor, physical data)

 Simulated Output for 1 Iteration
–

700 TB
Graph500 “Big Data” Benchmark
Kronecker graph BSP Problem

A: 0.57, B: 0.19
C: 0.19, D: 0.05

November 15, 2010
Graph 500 Takes Aim at a New Kind of HPC
Richard Murphy (Sandia NL => Micron)

“ I expect that this ranking may at times look very 
different from the TOP500 list. Cloud architectures 
will almost certainly dominate a major chunk of 
part of the list.”

Reality: Top500 Supercomputers Dominate
No Cloud IDCs at all
TSUBAME2.0 #3(Nov.2011) #4(Jun.2012)
Supercomputer Tokyo Tech.
Tsubame 2.0
#4 Top500 (2010)

A Major Northern Japanese
Cloud Datacenter (2013)
the Internet

>>

Advanced Silicon
Photonics 40G
single CMOS Die
1490nm DFB
100km Fiber

10GbE

~1500 nodes compute & storage
Full Bisection Multi-Rail
Optical Network
Injection 80GBps/Node
Bisection 220Terabps

Juniper MX480

Juniper MX480

2 zone switches (Virtual Chassis)

x1000!
Juniper 
EX4200

10GbE

10GbE

Juniper EX8208

Juniper EX8208

10GbE

LACP
Juniper 
EX4200

Zone (700 nodes)

Juniper 
EX4200

Juniper 
EX4200

Zone (700 nodes)

Juniper 
EX4200

Juniper 
EX4200

Zone (700 nodes)

8 zones, Total 5600 nodes,
Injection 1GBps/Node
Bisection 160Gigabps
But what does “220Tbps” mean?
Global IP Traffic, 2011-2016 (Source Cicso)
2011

2012

2013

2014

2015

2016

CAGR
2011-2016

By Type (PB per Month / Average Bitrate in Tbps)
Fixed
Internet
Manage
d IP
Mobile
data
Total IP
traffic

23,288
71.9
6,849
21.1
597
1.8
30,734
94.9

32,990
101.8
9,199
28.4
1,252
3.9
43,441
134.1

40,587
125.3
11,846
36.6
2,379
7.3
54,812
169.2

50,888
157.1
13,925
43.0
4,215
13.0
69,028
213.0

64,349 81,347
198.6
251.1
16,085 18,131
49.6
56.0
6,896 10,804
21.3
33.3
87,331 110,282
269.5
340.4

TSUBAME2.0 Network has TWICE
the capacity of the Global Internet,
being used by 2.1 Billion users
NEC Confidential

28%
21%
78%
29%
“convergence” at future extreme scale 
for computing and data (in Clouds?)

Source: Assessing trends over
time in performance, costs, and energy
use for servers, Intel, 2009.

HPC: x1000 in 10 years

CAGR ~= 100%

IDC: x30 in 10 years
Server unit sales flat 
(replacement demand)

CAGR ~= 30‐40%
What does this all mean?
• “Leveraging of mainframe technologies in HPC has
been dead for some time.”
• But will leveraging Cloud/Mobile be sufficient?
• NO! They are already falling behind, and will be
perpetually behind
– CAGR of Clouds 30%, HPC 100%: all data supports it
– Stagnation in network, storage, scaling, …

• Rather, HPC will be the technology driver for
future Big Data, for Cloud/Mobile to leverage!
– Rather than repurposed standard servers
Future “Extreme Big Data”
- NOT mining Tbytes Silo Data
-

Peta~Zetabytes of Data
Ultra High-BW Data Stream
Highly Unstructured, Irregular
Complex correlations between
data from multiple sources
- Extreme Capacity, Bandwidth,
73
Compute All Required
[Slide courtesy Alok Choudhary, Northeastern
Extreme Big Data not just traditional HPC!!! U
--- Analysis of required system properties
---

74

Extreme-Scale Computing

Big Data Analytics

BDEC Knowledge Discovery Engine

Processor Speed
1
Algorithmic Variety

Memory/ops
0.8
0.6

Power Optimization Opportunities

OPS

0.4
0.2
0

Comm patterns variability

Approximate Computations

Comm Latency tolerance

Write Performance

Local Persistent Storage

Read Performance
EBE Research Scheme

Future Non-Silo Extreme Big Data Apps
Ultra Large Scale
Graphs and Social
Infrastructures

Large Scale
Metagenomics

Co-Design
EBD Bag

Massive Sensors and
Data Assimilation in
Weather Prediction

Co-Design Co-Design

EBD System Software
incl. EBD Object System

Cartesian Plane
KV
S

KV
S

Graph Store
NVM/Fla
NVM/Flas
2Tbps HBM
NVM/Fla
sh
h
4~6HBM Channels NVM/Flas
NVM/Fla
NVM/Flas
sh
h
1.5TB/s DRAM & h
sh
DRAM
DRAM
NVM BW
DRAM
DRAM
DRAM
DRAM
Low
Low
High Powered
30PB/s I/O BW Possible
Main CPU
Power
Power
1 Yottabyte / YearCPU
CPU
TSV Interposer

KV
S

EBD KVS

Exascale Big Data HPC
PCB

Convergent Architecture (Phases 1~4)
Large Capacity NVM, High-Bisection NW

Cloud IDC
Very low BW & Efficiencty

Supercomputers
Compute&Batch-Oriented
Phase4: 2019‐20 DRAM+NVM+CPU 
with 3D/2.5D Die Stacking
‐The Ultimate Convergence of BD and EC‐
NVM/Flash
NVM/Flash
NVM/Flash

2Tbps HBM
4~6HBM Channels
1.5TB/s DRAM & 
NVM BW

DRAM

NVM/Flash
NVM/Flash
NVM/Flash
DRAM
DRAM

DRAM

30PB/s I/O BW Possible
1 Yottabyte / Year

Low Power CPU

High Powered Main CPU

Low Power CPU

DRAM

TSV Interposer
PCB

DRAM
Preliminary I/O Performance Evaluation 
on GPU and NVRAM
How to design local storage for next‐gen supercomputers ?
‐ Designed a local I/O prototype using 16 mSATA SSDs
mSATA

mSATA

mSATA

ポCapacity: 4TB
ポRead bandwidth: 8 GB/s

mSATA

RAID card

Mother board

I/O performance from GPU to multiple mSATA SSDs

I/O performance of multiple mSATA SSD

9000
7000

3
Throughuput [GB/s]

Bandwidth [MB/s]

3.5

Raw mSATA 4KB
RAID0 1MB
RAID0 64KB

8000
6000
5000
4000
3000
2000

〜 7.39 GB/s from 

1000

16 mSATA SSDs (Enabled RAID0)

0

Raw 8 mSATA
8 mSATA RAID0 (1MB)
8 mSATA RAID0 (64KB)

2.5
2
1.5
1
0.5

〜 3.06 GB/s from 
8 mSATA SSDs to GPU

0
0

5

10
# mSATAs

15

20

0.2740.547 1.09 2.19 4.38 8.75 17.5 35

Matrix Size [GB]

70

140
Algorithm Kernels on EBD

Large Scale BFS Using NVRAM

1. Introduction
• Large scale graph processing in
various domains
DRAM resources has increased

• Spread of Flash Devices
Prof : Price per bit,Energy consumption
Cons: Latency,Throughput

Using NVRAMs for large scale graph processing has possibilities of 
minimum performance degradation 
2.Hybrid‐BFS

3.Proporsal
① offload small accesses data 

Switch two approaches

Top‐down

Bottom‐up

# of frontiers:nfrontier, # of all vertices:nall,           parameter : α, β

GTEPS

4.Evaluation

6.0
5.0
4.0
3.0
2.0
1.0
0.0

5.2GTEPS

② BFS with reading data 
from NVRAM

DRAM Only(β=10α) ● Pearce, et al. :  13 times larger datasets
DRAM+SSD(β=0.1ι) with  52 MTEPS(DRAM 1TB, 12TB NVRAM)

2.8GTEPS
(47.1% down)
1.E+04 1.E+05 1.E+06 1.E+07
Switching Parameter ι

● We could reduce half the size of DRAM 
with 47.1% performance degradation
(130M vertices,2.1G edges)
● We are working on multiplexed I/O
→ multiplexed I/O improve  NVRAM’s I/O
performance
High Performance Sorting
Fast algorithms:
Distribution vs Comparison-based
N log(N) classical sorts
(quick, merge etc)

MSD radix sort

LSD radix sort

(THRUST)
variable length /
short length /
long keys
high efficiency on small fixed-length keys
alphabets
apple
apricot
banana
kiwi

Scalability
Bitonic sort

Computational
don't have to examine
Genomics
all characters
(A,C,G,T)

Comparison of keys

Map-Reduce
Hadoop easy to use but
not
that efficient

integer sorts

Efficient
implementation
GPUs are good
at counting
numbers

Hybrid approaches/
Best to be found

Good for GPU nodes
Balancing IO / computation

–
Twitter network (Application of Graph500 Benchmark)
Follow‐ship network 2009

Frontier size in BFS
with source as User 21,804,357

User j
Lv
User i

(i, j)‐edge

41 million vertices and 2.47 billion edges

Our NUMA‐optimized BFS
on 4‐way Xeon system

69 ms / BFS       
⇒ 21.28 GTEPS
Six‐degrees of separation

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Total

Frontier size
1 
7 
6,188 
510,515 
29,526,508 
11,314,238 
282,456 
11536 
673 
68 
19 
10 
5 
2 
2 
2 
41,652,230 

Freq. (%) Cum. Freq. (%)
0.00 
0.00 
0.01 
1.23 
70.89 
27.16 
0.68 
0.03 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
0.00 
100.00 

0.00 
0.00 
0.01 
1.24 
72.13 
99.29 
99.97 
100.00 
100.00 
100.00 
100.00 
100.00 
100.00 
100.00 
100.00 
100.00 
‐
100,000 Times Fold EBD “Convergent” System Overview
Tasks 5‐1~5‐3

EBD Application Co‐
Design and Validation

Task 3
EBD Programming System
Task 2

Graph Store

Task 1

Data Assimilation
in Large Scale Sensors 
and Exascale 
Atmospherics

Task 4
EBD “converged” Real‐Time 
Resource Scheduling

EBD Distrbuted Object Store on
100,000 NVM Extreme Compute 
and Data Nodes
Ultra Parallel & Low Powe I/O EBD 
“Convergent” Supercomputer
~10TB/s⇒~100TB/s⇒~10PB/s

Ultra High BW & Low Latency NVM
TSUBAME 2.0/2.5

EBD Performance Modeling 
& Evaluation

Large Scale Graphs 
and Social 
Infrastructure Apps

Large Scale 
Genomic 
Correlation

EBD Bag

Task6

TSUBAME 3.0

Cartesian Plane
KVS

KVS
KVS

EBD KVS

Ultra High BW & Low Latency NW 
Processor‐in‐memory

3D stacking
Summary

• TSUBAME1.0->2.0->2.5->3.0->…

– Tsubame 2.5 Number 1 in Japan, 17 Petaflops SFP
– Template for future supercomputers and IDC machines

• TSUBAME3.0 Early 2016

– New supercomputing leadership
– Tremendous power efficiency, extreme big data,
extremely high reliability

• Lots of background R&D for TSUBAME3.0 and
towards Exascale
–
–
–
–
–

Green Computing: ULP-HPC & TSUBAME-KFC
Extreme Big Data – Convergence of HPC and IDC!
Exascale Resilience
Programming with Millions of Cores
…

• Please stay tuned! 乞うご期待。応援をお願いします。

More Related Content

What's hot

Supercomputers
SupercomputersSupercomputers
Supercomputers
Mehmet Demir
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
Heiko Joerg Schick
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Heiko Joerg Schick
 

What's hot (20)

CUDA
CUDACUDA
CUDA
 
Japan Lustre User Group 2014
Japan Lustre User Group 2014Japan Lustre User Group 2014
Japan Lustre User Group 2014
 
L05 parallel
L05 parallelL05 parallel
L05 parallel
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Supercomputers
SupercomputersSupercomputers
Supercomputers
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術
 
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
 
Accelerating Scientific Discovery V1
Accelerating Scientific Discovery V1Accelerating Scientific Discovery V1
Accelerating Scientific Discovery V1
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 
ECP Application Development
ECP Application DevelopmentECP Application Development
ECP Application Development
 
End nodes in the Multigigabit era
End nodes in the Multigigabit eraEnd nodes in the Multigigabit era
End nodes in the Multigigabit era
 
Vector processor : Notes
Vector processor : NotesVector processor : Notes
Vector processor : Notes
 
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
 
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)Report on GPGPU at FCA  (Lyon, France, 11-15 October, 2010)
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
LUT-Network Revision2 -English version-
LUT-Network Revision2 -English version-LUT-Network Revision2 -English version-
LUT-Network Revision2 -English version-
 

Similar to [RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data

Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
TELECOM I+D
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
Sri Prasanna
 

Similar to [RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data (20)

Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
Mateo valero p1
Mateo valero p1Mateo valero p1
Mateo valero p1
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Comparison between computers of past and present
Comparison between computers of past and presentComparison between computers of past and present
Comparison between computers of past and present
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osio
 
Sponge v2
Sponge v2Sponge v2
Sponge v2
 
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
 
POWER10 innovations for HPC
POWER10 innovations for HPCPOWER10 innovations for HPC
POWER10 innovations for HPC
 
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
 

More from Rakuten Group, Inc.

More from Rakuten Group, Inc. (20)

コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
 
楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり楽天における安全な秘匿情報管理への道のり
楽天における安全な秘匿情報管理への道のり
 
What Makes Software Green?
What Makes Software Green?What Makes Software Green?
What Makes Software Green?
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
 
DataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組みDataSkillCultureを浸透させる楽天の取り組み
DataSkillCultureを浸透させる楽天の取り組み
 
大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開大規模なリアルタイム監視の導入と展開
大規模なリアルタイム監視の導入と展開
 
楽天における大規模データベースの運用
楽天における大規模データベースの運用楽天における大規模データベースの運用
楽天における大規模データベースの運用
 
楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー楽天サービスを支えるネットワークインフラストラクチャー
楽天サービスを支えるネットワークインフラストラクチャー
 
楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割楽天の規模とクラウドプラットフォーム統括部の役割
楽天の規模とクラウドプラットフォーム統括部の役割
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdfRakuten Services and Infrastructure Team.pdf
Rakuten Services and Infrastructure Team.pdf
 
The Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdfThe Data Platform Administration Handling the 100 PB.pdf
The Data Platform Administration Handling the 100 PB.pdf
 
Supporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdfSupporting Internal Customers as Technical Account Managers.pdf
Supporting Internal Customers as Technical Account Managers.pdf
 
Making Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdfMaking Cloud Native CI_CD Services.pdf
Making Cloud Native CI_CD Services.pdf
 
How We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdfHow We Defined Our Own Cloud.pdf
How We Defined Our Own Cloud.pdf
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
Travel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech infoTravel & Leisure Platform Department's tech info
Travel & Leisure Platform Department's tech info
 
OWASPTop10_Introduction
OWASPTop10_IntroductionOWASPTop10_Introduction
OWASPTop10_Introduction
 
Introduction of GORA API Group technology
Introduction of GORA API Group technologyIntroduction of GORA API Group technology
Introduction of GORA API Group technology
 
100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情100PBを越えるデータプラットフォームの実情
100PBを越えるデータプラットフォームの実情
 
社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー社内エンジニアを支えるテクニカルアカウントマネージャー
社内エンジニアを支えるテクニカルアカウントマネージャー
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data

  • 1. TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data Satoshi Matsuoka Professor Global Scientific Information and Computing (GSIC) Center Tokyo Institute of Technology Fellow, Association for Computing Machinery (ACM) Rakuten Technology Conference 2013 2013/10/26 Tokyo, Japan
  • 2. Supercomputers from the Past Fast, Big, Special, Inefficient, Evil device to conquer the world…
  • 3. Let us go back to the mid ’70s Birth of “microcomputers” and arrival of commodity computing (start of my career) • Commodity 8-bit CPUs… – Intel 4004/8008/8080/8085, Zilog Z-80, Motorola 6800, MOS Tech. 6502, … • Lead to hobbyist computing… – Evaluation boards: Intel SDK-80, Motorola MEK6800D2, MOS Tech. KIM-1, (in Japan) NEC TK-80, Fujitsu Lkit-8, … – System Kits: MITS Altair 8800/680b, IMSAI 8080, Proc. Tech. SOL-20, SWTPC 6800, … • & Lead to early personal computers – Commodore PET, Tandy TRS-80, Apple II – (in Japan): Hitachi Basic Master, NEC CompoBS / PC8001, Fujitsu FM-8, …
  • 4. Supercomputing vs. Personal Computing in the late 1970s. • Hitachi Basic Master (1978) – “The first PC in Japan” – Motorola 6802--1Mhz, 16KB ROM, 16KB RAM – Linpack in BASIC: Approx. 70-80 FLOPS (1/1,000,000) • We got “simulation” done (in assembly language) – Nintendo NES (1982) • MOS Technology 6502 1Mhz (Same as Apple II) – “Pinball” by Matsuoka & Iwata (now CEO Nintendo) • Realtime dynamics + collision + lots of shortcuts • Average ~a few KFLOPS Cf. Cray-1 Running Linpack 10 (1976) Linpack 80-90MFlops (est.)
  • 5. Then things got accelerated around the mid 80s to mid 90s (rapid commoditization towards what we use now) • PC CPUs: Intel 8086/286/386/486/Pentium (Superscalar&fast FP x86), Motorola 68000/020/030/040, … to Xeons, GPUs, Xeon Phi’s – C.f. RISCs: SPARC, MIPS, PA-RISC, IBM Power, DEC Alpha, … • Storage Evolution: Cassettes, Floppies to HDDs, optical disk to Flash • Network Evolution: RS-232C to Ethernet now to FDR Infinininband • PC (incl. I/O): IBM PC “Clones” and Macintoshes: ISA to VLB to PCIe • Software Evolution: CP/M to MS-DOS to Windows, Linux, • WAN Evolution: RS-232+Modem+BBS to Modem+Internet to ISDN/ADSL/FTTH broadband, DWDM Backbone, LTE, … • Internet Evolution: email + ftp to Web, Java, Ruby, … • Then Clusters, Grid/Clouds, 3-D Gaming, and Top500 all started in the mid 90s(!), and commoditized supercomputing
  • 6. Modern Day Supercomputers  Now supercomputers “look like” IDC servers  High-End COTS dominate Linux based machine with standard + HPC OSS Software Stack NEC Confidential
  • 8. Top Supercomputers vs. Global IDC K Computer (#1 2011-12) Riken-AICS Fujitsu Sparc VIII-fx Venus CPU 88,000 nodes, 800,000CPU cores ~11 Petaflops (1016) 1.4 Petabyte memory, 13 MW Power 864 racks、3000m2 Tianhe2 (#1 2013) China Gwanjou 48,000 KNC Xeon Phi + 36,000 Ivy Bridge Xeon 18,000 nodes, >3 Million CPU cores 54 Petaflops (1016) 0.8 Petabyte memory, 20 MW Power ??? racks、???m2 C.f. Amazon ~= 450,000 Nodes, ~3 million Cores #1 2012 IBM BlueGene/Q “Sequoia” Lawrence Livermore National Lab DARPA study IBM PowerPC System-On-Chip 98,000 nodes, 1.57million Cores 2020 Exaflop (1018) ~20 Petaflops 100 million~ 1.6 Petabytes, 8MW, 96 racks NEC Confidential 1 Billion Cores
  • 9. Scalability and Massive Parallelism  More nodes & core => Massive Increase in parallelism Faster, “Bigger” Simulation Qualitative Difference Performance BAD! GOOD! BAD! Ideal Linear Scaling Difficult to Achieve Limitations in Power, Cost, Reliability Limitations in Scaling CPU Cores ~= Parallelism NEC Confidential
  • 11. 2006: TSUBAME1.0 as No.1 in Japan All University Centers COMBINED 45 TeraFlops > Total 85 TeraFlops, #7 Top500 June 2006 Earth Simulator 40TeraFlops #1 2002~2004
  • 12. TSUBAME2.0 Nov. 1, 2010 “The Greenest Production Supercomputer in the World” TSUBAME 2.0 New Development 32nm 40nm >12TB/s Mem BW >400GB/s Mem BW >1.6TB/s Mem BW 35KW Max 80Gbps NW BW ~1KW max 12 >600TB/s Mem BW 220Tbps NW Bisecion BW 1.4MW Max
  • 13. 1500 1250 1000 750 500 CPU 250 0 GPU GPU Memory Bandwidth [GByte/s] Peak Performance [GFLOPS] 1750 Performance Comparison of CPU vs. 200 GPU 160 120 80 CPU 40 0 x5-6 socket-to-socket advantage in both compute and memory bandwidth, Same power (200W GPU vs. 200W CPU+memory+NW+…)
  • 14. TSUBAME2.0 Compute Node Thin Node Infiniband QDR x2 (80Gbps) 1.6 Tflops 400GB/s Mem BW 80GBps NW ~1KW max Productized as HP ProLiant SL390s HP SL390G7 (Developed for TSUBAME 2.0) GPU: NVIDIA Fermi M2050 x 3 515GFlops, 3GByte memory /GPU CPU: Intel Westmere-EP 2.93GHz x2 (12cores/node) Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD:60GBx2, 120GBx2 NEC Confidential Total Perf 2.4PFlops Mem: ~100TB SSD: ~200TB 4-1
  • 15. TSUBAME2.0 Storage Overview TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape) Infiniband QDR Network for LNET and Other Services QDR IB (×4) × 8 QDR IB(×4) × 20 GPFS#1 SFA10k #1 SFA10k #2 /work9 “Global Work Space” #1 GPFS with HSM SFA10k #3 SFA10k #4 SFA10k #5 /work0 /work19 /gscr0 “Global Work Space” #2 “Global Work Space” #3 Lustre “Scratch” 3.6 PB 30~60GB/s GPFS#2 GPFS#3 10GbE × 2 GPFS#4 HOME HOME System application iSCSI SFA10k #6 “cNFS/Clusterd Samba w/ GPFS” “NFS/CIFS/iSCSI by BlueARC” Home Volumes 1.2PB Parallel File System Volumes 2.4 PB HDD + 〜4PB Tape “Thin node SSD” “Fat/Medium node SSD” 250 TB, 300~500GB/s Scratch 130 TB=> 500TB~1PB Grid Storage
  • 16. TSUBAME2.0 Storage Overview TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape) Infiniband QDR Network for LNET and Other Services QDR IB (×4) × 8 QDR IB(×4) × 20 GPFS#1 Concurrent Parallel I/O (e.g. MPI-IO) SFA10k #1 SFA10k #2 /work9 SFA10k #3 SFA10k #4 SFA10k #5 /work0 /work19 /gscr0 Read mostly I/O (data-intensive apps, parallel workflow, “Global Work “Global Work parameterSpace” #1 survey) Space” #2 “Global Work Space” #3 GPFS with HSM “Scratch” Lustre 3.6 Fine-grained R/W PB I/O Parallel File System Volumes (checkpoints, temporary files, Big Data processing) GPFS#2 GPFS#3 10GbE × 2 GPFS#4 • Home storage for computing nodes •HOME Cloud-based campus storage HOME services System application iSCSI SFA10k #6 “cNFS/Clusterd Samba w/ GPFS” “NFS/CIFS/iSCSI by BlueARC” Home Volumes 1.2PB Data transfer service between SCs/CCs 2.4Long-Term PB HDD + Backup 〜4PB Tape “Thin node SSD” “Fat/Medium node SSD” 250 TB, 300GB/s Scratch 130 TB=> 500TB~1PB HPCI Storage
  • 17. 3500 Fiber Cables > 100Km w/DFB Silicon Photonics End-to-End 7.5GB/s, > 2us Non-Blocking 200Tbps Bisection NEC Confidential
  • 18. 2010: TSUBAME2.0 as No.1 in Japan > Total 2.4 Petaflops #4 Top500, Nov. 2010 All Other Japanese Centers on the Top500 COMBINED 2.3 PetaFlops
  • 19. TSUBAME Wins Awards… “Greenest Production Supercomputer in the World” the Green 500 Nov. 2010, June 2011 (#4 Top500 Nov. 2010) 3 times more power efficient than a laptop!
  • 20. TSUBAME Wins Awards… ACM Gordon Bell Prize 2011 2.0 Petaflops Dendrite Simulation Special Achievements in Scalability and Time-to-Solution “Peta-Scale Phase-Field Simulation for Dendritic Solidification on the TSUBAME 2.0 Supercomputer”
  • 21. TSUBAME Wins Awards… Commendation for Sci &Tech by Ministry of Education 2012 (文部科学大臣表彰) Prize for Sci & Tech, Development Category Development of Greenest Production Peta-scale Supercomputer Satoshi Matsuoka, Toshio Endo, Takayuki Aoki
  • 22. Precise Bloodflow Simulation of Artery on TSUBAME2.0 (Bernaschi et. al., IAC-CNR, Italy) Personal CT Scan + Simulation => Accurate Diagnostics of Cardiac Illness 5 Billion Red Blood Cells + 10 Billion Degrees of Freedom
  • 23. MUPHY: Multiphysics simulation of blood flow (Melchionna, Bernaschi et al.) Combined Lattice-Boltzmann (LB) simulation for plasma and Molecular Dynamics (MD) for Red Blood Cells Realistic geometry ( from CAT scan) Multiphyics simulation with MUPHY software Fluid: Blood plasma Lattice Boltzmann Body: Red blood cell coupled Irregular mesh is divided by using PT-SCOTCH tool, considering cutoff distance Extended MD Red blood cells (RBCs) are represented as ellipsoidal particles Two-levels of parallelism: CUDA (on GPU) + MPI • 1 Billion mesh node for LB component •100 Million RBCs 4000 GPUs, ACM Gordon Bell Prize 2011 Honorable Mention 0.6Petaflops
  • 24. Lattice-Boltzmann-LES with Coherent-structure SGS model [Onodera&Aoki2013] Coherent-structure Smagorinsky model Second invariant of the velocity gradient tensor(Q) and the Energy dissipation(Îľ) The model parameter is locally determined by second invariant of the velocity gradient tensor. ◎ Turbulent flow around a complex object ◎ Large-scale parallel computation Copyright Š Global Scientific Information and Computing Center, Tokyo Institute of Technology
  • 25. Computational Area – Entire Downtown Tokyo Major part of Tokyo Including Shnjuku-ku, Chiyoda-ku, Minato-ku, Meguro-ku, Chuou-ku, Shinjyuku Tokyo 10km×10km Building Data: Pasco Co. Ltd. TDM 3D Achieved 0.592 Petaflops using over 4000 GPUs (15% efficiency) Shibuya Shinagawa Map Š2012 Google, ZENRIN Copyright Š Global Scientific Information and Computing Center, Tokyo Institute of Technology 25
  • 26. Copyright Š Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
  • 27. Area Around Metropolitan Government Building Flow profile at the 25m height on the ground Wind 640 m 960 m 地図データ Š2012 Google, ZENRIN Copyright Š Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 27
  • 28. Copyright Š Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 28
  • 29. Current Weather Forecast 5km Resolution (Inaccurate Cloud Simulation) ASUCA Typhoon Simulation on TSUBAME2.0 500m Resolution 4792×4696×48 , 437 GPUs (x1000 resolution) 29
  • 30.
  • 31. CFD analysis over a car body Calculation conditions * Number of grid points: 3,623,878,656 (3,072 × 1,536 × 768) *Grid resolution:4.2mm (13m x 6.5 m x 3.25m) *Number of GPUs: 288 (96 Nodes) 60 km/h
  • 32. LBM: DriVer: BMW-Audi Lehrstuhl fĂźr Aerodynamik und StrĂśmungsmechanik Technische Universität MĂźnchen 3,000x1,500x1,500 Re = 1,000,000 32
  • 33. 33
  • 34. 34
  • 39. Towards TSUBAME 3.0 Interim Upgrade TSUBAME2.0 to 2.5 (Early Fall 2013) • Upgrade the TSUBAME2.0s GPUs NVIDIA Fermi M2050 to Kepler K20X SFP/DFP peak from 4.8PF/2.4PF => 17PF/5.7PF TSUBAME2.0 Compute Node Fermi GPU 3 x 1408 = 4224 GPUs c.f. The K Computer 11.2/11.2 Acceleration of Important Apps Considerable Improvement Summer 2013 Significant Capacity Improvement at low cost & w/o Power Increase TSUBAME3.0 2H2015
  • 40. TSUBAME2.0⇒2.5 Thin Node Upgrade Thin Node Infiniband QDR x2 (80Gbps) Peak Perf. 4.08 Tflops ~800GB/s Mem BW 80GBps NW ~1KW max HP SL390G7 (Developed for TSUBAME 2.0, Modified for 2.5) GPU: NVIDIA Kepler K20X x 3 1310GFlops, 6GByte Mem(per GPU) Productized as HP ProLiant SL390s Modified for TSUABME2.5 CPU: Intel Westmere-EP 2.93GHz x2 Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2) lanes --- 3GPUs + 2 IB QDR Memory: 54, 96 GB DDR3-1333 SSD:60GBx2, 120GBx2 NVIDIA Fermi  M2050 1039/515 GFlops NVIDIA Kepler K20X 3950/1310 GFlops
  • 41. 2013: TSUBAME2.5 No.1 in Japan in Single Precision FP, 17 Petaflops All University Centers COMBINED 9 Petaflops SFP ~= Total 17.1 Petaflops SFP 5.76 Petaflops DFP K Computer 11.4 Petaflops SFP/DFP
  • 42. TSUBAME2.0 TSUBAME2.5 Thin Node x 1408 Units Node Machine CPU HP Proliant SL390s Intel Xeon X5670 ← No Change ← No Change (6core 2.93GHz, Westmere) x 2 GPU Node Performance (incl. CPU Turbo boost) NVIDIA Tesla M2050 x 3  448 CUDA cores (Fermi)  SFP 1.03TFlops  DFP 0.515TFlops  3GiB GDDR5 memory  150GB Peak, ~90GB/s STREAM Memory BW  SFP 3.40TFlops  DFP 1.70TFlops  ~500GB Peak, ~300GB/s STREAM Memory BW NVIDIA Tesla K20X x 3  2688 CUDA cores (Kepler)  SFP 3.95TFlops  DFP 1.31TFlops  6GiB GDDR5 memory  250GB Peak, ~180GB/s STREAM Memory BW  SFP 12.2TFlops  DFP 4.08TFlops  ~800GB Peak, ~570GB/s STREAM Memory BW TOTAL System Total System Performance  SFP 4.80PFlops  DFP 2.40PFlops  Peak ~0.70PB/s, STREAM ~0.440PB/s Memory BW  SFP 17.1PFlops (x3.6)  DFP 5.76PFlops (x2.4)  Peak ~1.16PB/s, STREAM ~0.804PB/s Memory BW (x1.8)
  • 43. Phase‐field simulation for Dendritic Solidification [Shimokawabe, Aoki et. al.] Weak scaling on TSUBAME (Single precision) Mesh size1GPU+4 CPU cores):4096 x 162 x 130 TSUBAME 2.5 3.444 PFlops (3,968 GPUs+15,872 CPU cores) 4,096 x 5,022 x 16,640 Developing lightweight strengthening  material by controlling microstructure TSUBAME 2.0 2.000 PFlops (4,000 GPUs+16,000 CPU cores) Low‐carbon society 4,096 x 6,480 x 13,000 • • Peta‐Scale phase‐field simulations can simulate the multiple dendritic growth during  solidification required for the evaluation of new materials. 2011 ACM Gordon Bell Prize Special Achievements in Scalability and Time‐to‐Solution
  • 44. Peta‐scale stencil application : A Large‐scale LES Wind Simulation using Lattice Boltzmann Method [Onodera, Aoki et. al.] Weak scalability in single precision Large-scale Wind Simulation for a 10km x 10km Area in Metropolitan Tokyo (N = 192 x 256 x 256) Performance [TFlops] 10,080 x 10,240 x 512  (4,032 GPUs) The above peta‐scale simulations were executed as the  TSUBAME Grand Challenge Program, Category A in 2012 fall.  • • ▲ TSUBAME 2.5 (overlap) ● TSUBAME 2.0 (overlap) TSUBAME 2.5 1142 TFlops (3968 GPUs) 288 GFlops / GPU x 1.93 TSUBAME 2.0 149 TFlops (1000 GPUs) 149 GFlops / GPU Number of GPUs The LES wind simulation for the area 10km × 10km with 1‐m resolution has never  been done before in the world.  We achieved 1.14 PFLOPS using 3968 GPUs on the TSUBAME 2.5 supercomputer.
  • 46. Application TSUBAME2.0 Performance TSUBAME2.5 Performance Boost  Ratio Top500/Linpack (PFlops) 1.192 2.843 2.39 Green500/Linpack (GFlops/W) 0.958 > 2.400 > 2.50 Semi‐Definite Programming  Nonlinear Optimization (PFlops) 1.019 1.713 1.68 Gordon Bell Dandrite Stencil  (PFlops) 2.000 3.444 1.72 LBM LES Whole City Airflow  (PFlops) 0.600 1.142 1.90 Amber 12 pmemd 4 nodes 8  GPUs (nsec/day) 3.44 11.39 3.31 GHOSTM Genome Homology  Search (Sec) 19361 10785 1.80 MEGADOC Protein Docking (vs.  1CPU core) 37.11 83.49 2.25
  • 47. TSUBAME Evolution Towards Exascale and Extreme Big Data 25‐30PF 1TB/s 5.7PF Graph 500 No. 3 (2011) Awards 3.0 Phase2 Fast I/O 2.5 5~10PB Phase1 10TB/s Fast I/O > 100mil  250TB 300GB/s iOPs 30PB/Day 1ExaB/Day 2015H2 Copyright Š Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology 47
  • 48. x DoE Exascale Parameters x1000 power efficiency in 10 years System attributes “2010” “2015” System peak 2 PetaFlops 100-200 PetaFlops 1 ExaFlop Power Jaguar TSUBAME 6 MW 1.3 MW 15 MW 20 MW System Memory 0.3PB 0.1PB 5 PB 32-64PB Node Perf 125GF 1.6TF 0.5TF Node Mem BW 25GB/s Node Concurrency #Nodes 1TF 10TF 0.5TB/s 0.1TB/s 1TB/s 0.4TB/s 4TB/s 12 O(1000) O(100) O(1000) O(1000) 18,700 1442 50,000 5,000 1 million 8GB/s 20GB/s 200GB/s O(1 day) O(1 day) Total Node 1.5GB/s Interconnect BW MTTI “2020” O(days) 7TF O(10000 ) Billion Cores 100,000
  • 49. Challenges of Exascale (FLOPS, Byte, …) (1018)! Various Physical Limitations Surface All‐at‐Once • # CPU Cores: 1Bil Low Power • # Nodes 100K~xM c.f. Total # of Smartphones sold globally = 400Mil c.f. The K Computer ~100K Google ~ 1 Mil • Mem: x00PB~ExaB c.f. Total mem all PCs (300Mil)  shipped globally in 2011 ~ ExaB BTW 264~=1.8x1019=18ExaB • Storage: xExaB c.f. Google Storage  2 Exabytes (200Mil x 7GB+)  • All of this at 20MW (50GFlops/W), reliability (MTTI=days),  ease of programming (billion cores?), cost… in 2020?! 49
  • 50. Focused Research Towards Tsubame 3.0 and Beyond towards Exa • Green Computing: Ultra Power Efficient HPC • High Radix Bisection Networks – HW, Topology, Routing Algorithms, Placement… • Fault Tolerance – Group-based Hierarchical Checkpointing, Fault Prediction, Hybrid Algorithms • Scientific “Extreme” Big Data – Ultra Fast I/O, Hadoop Acceleration, Large Graphs • New memory systems – Pushing the envelops of low Power vs. Capacity vs. BW, exploit the deep hierarchy with new algorithms to decrease Bytes/Flops • Post Petascale Programming – OpenACC and other manycore programming substrates, Task Parallel • Scalable Algorithms for Many Core – Apps/System/HW Co-Design
  • 51. JST‐CREST “Ultra Low Power (ULP)‐HPC” Project  2007‐2012 Ultra Multi-Core Slow & Parallel (& ULP) Auto‐Tuning for Perf. & Power ULP‐HPC SIMD‐ Vector (GPGPU, etc.) ABCLibScript: アルゴリズム選択 モデルと実測の Bayes 的融合 実行起動前自動チューニング指定、 • Bayes モデルと事前分布 アルゴリズム選択処理の指定 !ABCLib$ static select region start !ABCLib$ parameter (in CacheS, in NB, in NPrc) コスト定義関数で使われる !ABCLib$ select sub region start 入力変数 !ABCLib$ according estimated !ABCLib$ (2.0d0*CacheS*NB)/(3.0d0*NPrc) モデルによる 所要時間の推定 yi ~ N (  i ,  ) 2 i i |  ,  i2 ~ N ( xiT  ,  i2 /  0 ) コスト定義関数  i2 ~ Inv -  2 (v0 ,  02 ) ULP‐HPC Networks MRAM PRAM Flash etc. 所要時間の実測データ • n 回実測後の事後予測分布 yi | ( yi1 , yi 2 , , yin ) ~ t n ( in ,   n 1 /  n ) 2 in 0  n   0  n,  n   0  n,  n  ( 0 xiT   nyi ) /  n Low Power  High Perf Model Power Optimize using Novel Components in HPC 2  n n   0 02   ( ym  yi ) 2   0 n( yi  xiT  ) 2 /  n 対象1(アルゴリズム1) select sub region end select sub region start according estimated (4.0d0*ChcheS*dlog(NB))/(2.0d0*NPrc) 対象2(アルゴリズム2) !ABCLib$ select sub region end !ABCLib$ static select region end !ABCLib$ !ABCLib$ !ABCLib$ !ABCLib$ 1 1 yi   yim n Optimization  Point Power Perf x10 Power  Efficiencty 0 Perf. Model Algorithms Power‐Aware and Optimizable Applications 対象領域1、2 x1000 Improvement in 10 years
  • 52. Aggressive Power Saving in HPC Methodologies Enterprise/Business Clouds HPC Server Consolidation Good NG! Good Poor Poor Good Poor Good Limited Good DVFS (Dynamic Voltage/Frequency Scaling) New Devices New HW &SW Architecture Novel Cooling (Cost & Continuity) (Cost & Continuity) (Cost & Continuity) (high thermal density)
  • 53. How do we achive x1000? Process Shrink x100 X Many-Core GPU Usage x5 X DVFS & Other LP SW x1.5 X Efficient Cooling x1.4 x1000 !!! ULP-HPC Project 2007-12 Ultra Green Supercomputing Project 2011-15
  • 54. Statistical Power Modeling of GPUs [IEEE IGCC10] GPU performance counters n p    i ci   i 1 • Estimates GPU power consumption GPU  statistically • Linear regression model using performance  counters as explanatory variables High accuracyAvg Err 4.7%) Average power consumption • Prevents overtraining by ridge regression • Determines optimal parameters by cross  fitting Power meter with high resolution Accurate even with DVFS Future:  Model‐based power opt. Linear model shows sufficient accuracy Possibility of optimization of Exascale systems  with O(10^8) processors
  • 56. TSUBAME‐KFC: Ultra‐Green Supercomputer Testbed [2011‐2015] Fluid Submersion Cooling + Outdoor Air Cooling + High  Density GPU Supercomputing in a 20‐feet container Compute Nodes K20X GPU GRC Submersion Rack Heat Exchanger Processors 80~90℃ Oil 35~45℃ ⇒ Coolant oil 35~45℃ ⇒ Water 25~35℃ Heat Dissipation NEC/SMC 1U server x 40 • Intel IvyBridge 2.1GHz 6core×2 • NVIDIA Tesla K20X GPU ×4 • DDR3 memory 64GB, SSD 120GB • 4x FDR InfiniBand 56Gbps Per node Total Peak 210TFlops (DP) 630TFlops (SP) Target Facility 20 feet container(16m2) • Cooling Tower: Water 25~35℃ ⇒ Outdoor air Coolant oil: Spectrasyn8 • World’s top power efficiency (>3GFlops/Watt) • Average PUE 1.05, lower component power • Field test ULP‐HPC results
  • 57. TSUBAME-KFC Towards TSUBAME3.0 and Beyond Shooting for #1 on Nov. 2013 Green 500!
  • 58. Total Mem BW TB/s (STREAM) Mem BW MByte/S / W Factor (incl.  cooling) Linpack Linpack MFLOPs/ Perf W (PF) 10MW 1.8MW 0.036 3.6 0.038 21 13,400 160 2,368 13 16 7.2 ORNL Jaguar (XT5. 2009Q4) ~9MW 1.76 196 256 432 48 Tsubame2.0 (2010Q4) 1.8MW 1.2 667 75 440 244 K Computer (2011Q2) ~16MW 10 625 BlueGene/Q (2012Q1) ~12MW? 17 TSUBAME2.5 (2013Q3) 1.4MW Tsubame3.0 (2015Q4~2016Q1) 1.5MW EXA (2019~20) 20MW Machine Earth Simulator 1 Tsubame1.0 (2006Q1) Power  x31.6 3300 206 ~1400 ~35 3000 250 ~3 ~2100 ~24 802 572 ~20 ~13,000 6000 4000 1000 80 x34 ~4 ~x20 50,000 1 ~x13.7 100K 5000
  • 59. Extreme Big Data (EBD) Next Generation Big Data Infrastructure Technologies Towards Yottabyte/Year Principal Invesigator Satoshi Matsuoka Global Scientific Information and Computing Center Tokyo Institute of Technolgoy
  • 60. The current “Big Data” are not really that Big… • Typical “real” definition: “Mining people’s privacy data to make money” • Corporate data are usually in data warehoused silo > limited volume, in Gigabytes~Terabytes, seldom Petabytes. • Processing involve simple O(n) algorithms, or those that can be accelerated with DB-inherited indexing algorithms • Executed on re-purposed commodity “web” servers linked with 1Gbps networks running Hadoop/HDFS • Vicious cycle of stagnation in innovations… • NEW: Breaking down of Silos ⇒ Convergence with Supercomputing with Extreme Big Data
  • 61. But “Extreme Big Data” will change everything • “Breaking down of Silos” (Rajeeb Harza, Intel VP of Technical Computing) • Already happening in Science & Engineering due to Open Data movement • More complex analysis algorithms: O(n log n), O(m x n), … • Will become the NORM for competitiveness reasons.
  • 62. We will have tons of unknown genes [Slide Courtesy Yutaka  Akiyama @ Tokyo Tech.] Metagenome analysis • Directly sequencing uncultured microbiomes obtained  from target environment and analyzing the sequence  data – Finding novel genes from unculturable microorganism – Elucidating composition of species/genes of environments Examples of microbiome Gut microbiome Human  body Soil Sea 62
  • 63. Results from Akiyama group@Tokyo Tech Ultra high‐sensitive “big data” metagenome sequence analysis of human oral microbiome ‐ Required  > 1 million  node*hour product  on K‐computer ‐ World’s most sensitive sequence analysis  (based on amino acid similarity matrix) ‐ Discovered  at least three microbiome clusters with functional differences. (Integrated 422 experiment samples taken from 9 different oral parts) 572.8 M Reads / hour  82,944 node (663,552 Cores) K‐computer (2012) Metabolic Pathway Map 歯列の内側 歯列の外側 歯垢 63
  • 65. Extremely “Big” Graphs • Large scale graphs in various fields – US Road network    :    58 million edges – Twitter follow‐ship : 1.47 billion edges – Neuronal network :   100 trillion edges large Social network Twitter 61.6 million vertices &  1.47 billion edges • Fast and scalable graph processing by using HPC Neuronal network @ Human Brain Project 89 billion vertices & 100 trillion edges US road network Cyber‐security 24 million vertices & 58 million edges 15 billion log entries / day
  • 66. K computer: 65536nodes Graph500: 5524GTEPS # of edges Human Brain Project 45 Graph500 (Huge) Symbolic Network 1 trillion edges Graph500 (Large) log2(m) 40 Graph500 (Medium) 35 Graph500 (Small) 1 billion edges 30 Twitter (tweets / day) Graph500 (Mini) Graph500 (Toy) USA-road-d.USA.gr 25 USA-road-d.LKS.gr 20 Android  tablet Tegra3 1.7GHz : 1GB RAM 0.15GTEPS: 64.12MTEPS/W 20 25 1 trillion nodes 1 billion nodes USA-road-d.NY.gr 15 USA Road Network 30 log2(n) 35 40 45 # of nodes
  • 67. Towards Continuous Billion-Scale Social Simulation with Real-Time Streaming Data (Toyotaro Suzumura/IBM-Tokyo Tech)  Applications – Target Area: Planet (Open Street Map) – 7 billion people  Input Data – Road Network (Open Street Map) for Planet: 300 GB (XML) – Trip data for 7 billion people • 10 KB (1 trip) x 7 billion = 70 TB – Real-Time Streaming Data (e.g. Social sensor, physical data)  Simulated Output for 1 Iteration – 700 TB
  • 68. Graph500 “Big Data” Benchmark Kronecker graph BSP Problem A: 0.57, B: 0.19 C: 0.19, D: 0.05 November 15, 2010 Graph 500 Takes Aim at a New Kind of HPC Richard Murphy (Sandia NL => Micron) “ I expect that this ranking may at times look very  different from the TOP500 list. Cloud architectures  will almost certainly dominate a major chunk of  part of the list.” Reality: Top500 Supercomputers Dominate No Cloud IDCs at all TSUBAME2.0 #3(Nov.2011) #4(Jun.2012)
  • 69. Supercomputer Tokyo Tech. Tsubame 2.0 #4 Top500 (2010) A Major Northern Japanese Cloud Datacenter (2013) the Internet >> Advanced Silicon Photonics 40G single CMOS Die 1490nm DFB 100km Fiber 10GbE ~1500 nodes compute & storage Full Bisection Multi-Rail Optical Network Injection 80GBps/Node Bisection 220Terabps Juniper MX480 Juniper MX480 2 zone switches (Virtual Chassis) x1000! Juniper  EX4200 10GbE 10GbE Juniper EX8208 Juniper EX8208 10GbE LACP Juniper  EX4200 Zone (700 nodes) Juniper  EX4200 Juniper  EX4200 Zone (700 nodes) Juniper  EX4200 Juniper  EX4200 Zone (700 nodes) 8 zones, Total 5600 nodes, Injection 1GBps/Node Bisection 160Gigabps
  • 70. But what does “220Tbps” mean? Global IP Traffic, 2011-2016 (Source Cicso) 2011 2012 2013 2014 2015 2016 CAGR 2011-2016 By Type (PB per Month / Average Bitrate in Tbps) Fixed Internet Manage d IP Mobile data Total IP traffic 23,288 71.9 6,849 21.1 597 1.8 30,734 94.9 32,990 101.8 9,199 28.4 1,252 3.9 43,441 134.1 40,587 125.3 11,846 36.6 2,379 7.3 54,812 169.2 50,888 157.1 13,925 43.0 4,215 13.0 69,028 213.0 64,349 81,347 198.6 251.1 16,085 18,131 49.6 56.0 6,896 10,804 21.3 33.3 87,331 110,282 269.5 340.4 TSUBAME2.0 Network has TWICE the capacity of the Global Internet, being used by 2.1 Billion users NEC Confidential 28% 21% 78% 29%
  • 71. “convergence” at future extreme scale  for computing and data (in Clouds?) Source: Assessing trends over time in performance, costs, and energy use for servers, Intel, 2009. HPC: x1000 in 10 years CAGR ~= 100% IDC: x30 in 10 years Server unit sales flat  (replacement demand) CAGR ~= 30‐40%
  • 72. What does this all mean? • “Leveraging of mainframe technologies in HPC has been dead for some time.” • But will leveraging Cloud/Mobile be sufficient? • NO! They are already falling behind, and will be perpetually behind – CAGR of Clouds 30%, HPC 100%: all data supports it – Stagnation in network, storage, scaling, … • Rather, HPC will be the technology driver for future Big Data, for Cloud/Mobile to leverage! – Rather than repurposed standard servers
  • 73. Future “Extreme Big Data” - NOT mining Tbytes Silo Data - Peta~Zetabytes of Data Ultra High-BW Data Stream Highly Unstructured, Irregular Complex correlations between data from multiple sources - Extreme Capacity, Bandwidth, 73 Compute All Required
  • 74. [Slide courtesy Alok Choudhary, Northeastern Extreme Big Data not just traditional HPC!!! U --- Analysis of required system properties --- 74 Extreme-Scale Computing Big Data Analytics BDEC Knowledge Discovery Engine Processor Speed 1 Algorithmic Variety Memory/ops 0.8 0.6 Power Optimization Opportunities OPS 0.4 0.2 0 Comm patterns variability Approximate Computations Comm Latency tolerance Write Performance Local Persistent Storage Read Performance
  • 75. EBE Research Scheme Future Non-Silo Extreme Big Data Apps Ultra Large Scale Graphs and Social Infrastructures Large Scale Metagenomics Co-Design EBD Bag Massive Sensors and Data Assimilation in Weather Prediction Co-Design Co-Design EBD System Software incl. EBD Object System Cartesian Plane KV S KV S Graph Store NVM/Fla NVM/Flas 2Tbps HBM NVM/Fla sh h 4~6HBM Channels NVM/Flas NVM/Fla NVM/Flas sh h 1.5TB/s DRAM & h sh DRAM DRAM NVM BW DRAM DRAM DRAM DRAM Low Low High Powered 30PB/s I/O BW Possible Main CPU Power Power 1 Yottabyte / YearCPU CPU TSV Interposer KV S EBD KVS Exascale Big Data HPC PCB Convergent Architecture (Phases 1~4) Large Capacity NVM, High-Bisection NW Cloud IDC Very low BW & Efficiencty Supercomputers Compute&Batch-Oriented
  • 77. Preliminary I/O Performance Evaluation  on GPU and NVRAM How to design local storage for next‐gen supercomputers ? ‐ Designed a local I/O prototype using 16 mSATA SSDs mSATA mSATA mSATA ポCapacity: 4TB ポRead bandwidth: 8 GB/s mSATA RAID card Mother board I/O performance from GPU to multiple mSATA SSDs I/O performance of multiple mSATA SSD 9000 7000 3 Throughuput [GB/s] Bandwidth [MB/s] 3.5 Raw mSATA 4KB RAID0 1MB RAID0 64KB 8000 6000 5000 4000 3000 2000 〜 7.39 GB/s from  1000 16 mSATA SSDs (Enabled RAID0) 0 Raw 8 mSATA 8 mSATA RAID0 (1MB) 8 mSATA RAID0 (64KB) 2.5 2 1.5 1 0.5 〜 3.06 GB/s from  8 mSATA SSDs to GPU 0 0 5 10 # mSATAs 15 20 0.2740.547 1.09 2.19 4.38 8.75 17.5 35 Matrix Size [GB] 70 140
  • 78. Algorithm Kernels on EBD Large Scale BFS Using NVRAM 1. Introduction • Large scale graph processing in various domains DRAM resources has increased • Spread of Flash Devices Prof : Price per bit,Energy consumption Cons: Latency,Throughput Using NVRAMs for large scale graph processing has possibilities of  minimum performance degradation  2.Hybrid‐BFS 3.Proporsal ① offload small accesses data  Switch two approaches Top‐down Bottom‐up # of frontiers:nfrontier, # of all vertices:nall,           parameter : ι, β GTEPS 4.Evaluation 6.0 5.0 4.0 3.0 2.0 1.0 0.0 5.2GTEPS ② BFS with reading data  from NVRAM DRAM Only(β=10Îą) ● Pearce, et al. :  13 times larger datasets DRAM+SSD(β=0.1Îą) with  52 MTEPS(DRAM 1TB, 12TB NVRAM) 2.8GTEPS 47.1% down) 1.E+04 1.E+05 1.E+06 1.E+07 Switching Parameter ι ● We could reduce half the size of DRAM  with 47.1% performance degradation (130M vertices,2.1G edges) ● We are working on multiplexed I/O → multiplexed I/O improve  NVRAM’s I/O performance
  • 79. High Performance Sorting Fast algorithms: Distribution vs Comparison-based N log(N) classical sorts (quick, merge etc) MSD radix sort LSD radix sort (THRUST) variable length / short length / long keys high efficiency on small fixed-length keys alphabets apple apricot banana kiwi Scalability Bitonic sort Computational don't have to examine Genomics all characters (A,C,G,T) Comparison of keys Map-Reduce Hadoop easy to use but not that efficient integer sorts Efficient implementation GPUs are good at counting numbers Hybrid approaches/ Best to be found Good for GPU nodes Balancing IO / computation –
  • 80. Twitter network (Application of Graph500 Benchmark) Follow‐ship network 2009 Frontier size in BFS with source as User 21,804,357 User j Lv User i (i, j)‐edge 41 million vertices and 2.47 billion edges Our NUMA‐optimized BFS on 4‐way Xeon system 69 ms / BFS        ⇒ 21.28 GTEPS Six‐degrees of separation 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total Frontier size 1  7  6,188  510,515  29,526,508  11,314,238  282,456  11536  673  68  19  10  5  2  2  2  41,652,230  Freq. (%) Cum. Freq. (%) 0.00  0.00  0.01  1.23  70.89  27.16  0.68  0.03  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  100.00  0.00  0.00  0.01  1.24  72.13  99.29  99.97  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  ‐
  • 81. 100,000 Times Fold EBD “Convergent” System Overview Tasks 5‐1~5‐3 EBD Application Co‐ Design and Validation Task 3 EBD Programming System Task 2 Graph Store Task 1 Data Assimilation in Large Scale Sensors  and Exascale  Atmospherics Task 4 EBD “converged” Real‐Time  Resource Scheduling EBD Distrbuted Object Store on 100,000 NVM Extreme Compute  and Data Nodes Ultra Parallel & Low Powe I/O EBD  “Convergent” Supercomputer ~10TB/s⇒~100TB/s⇒~10PB/s Ultra High BW & Low Latency NVM TSUBAME 2.0/2.5 EBD Performance Modeling  & Evaluation Large Scale Graphs  and Social  Infrastructure Apps Large Scale  Genomic  Correlation EBD Bag Task6 TSUBAME 3.0 Cartesian Plane KVS KVS KVS EBD KVS Ultra High BW & Low Latency NW  Processor‐in‐memory 3D stacking
  • 82. Summary • TSUBAME1.0->2.0->2.5->3.0->… – Tsubame 2.5 Number 1 in Japan, 17 Petaflops SFP – Template for future supercomputers and IDC machines • TSUBAME3.0 Early 2016 – New supercomputing leadership – Tremendous power efficiency, extreme big data, extremely high reliability • Lots of background R&D for TSUBAME3.0 and towards Exascale – – – – – Green Computing: ULP-HPC & TSUBAME-KFC Extreme Big Data – Convergence of HPC and IDC! Exascale Resilience Programming with Millions of Cores … • Please stay tuned! 乞うご期待。応援をお願いします。