Rakuten Technology Conference 2013
"TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data"
Satoshi Matsuoka
Professor
Global Scientific Information and Computing (GSIC) Center
Tokyo Institute of Technology
Fellow, Association for Computing Machinery (ACM)
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data
1. TSUBAME2.5 to 3.0 and
Convergence with Extreme
Big Data
Satoshi Matsuoka
Professor
Global Scientific Information and Computing (GSIC) Center
Tokyo Institute of Technology
Fellow, Association for Computing Machinery (ACM)
Rakuten Technology Conference 2013
2013/10/26
Tokyo, Japan
2. Supercomputers from the Past
Fast, Big, Special, Inefficient,
Evil device to conquer the worldâŚ
3. Let us go back to the mid â70s
Birth of âmicrocomputersâ and arrival of
commodity computing (start of my career)
â˘
Commodity 8-bit CPUsâŚ
â Intel 4004/8008/8080/8085,
Zilog Z-80, Motorola 6800, MOS
Tech. 6502, âŚ
â˘
Lead to hobbyist computingâŚ
â Evaluation boards: Intel SDK-80,
Motorola MEK6800D2, MOS Tech.
KIM-1, (in Japan) NEC TK-80,
Fujitsu Lkit-8, âŚ
â System Kits: MITS Altair
8800/680b, IMSAI 8080, Proc.
Tech. SOL-20, SWTPC 6800, âŚ
â˘
& Lead to early personal computers
â Commodore PET, Tandy TRS-80,
Apple II
â (in Japan): Hitachi Basic Master,
NEC CompoBS / PC8001, Fujitsu
FM-8, âŚ
4. Supercomputing vs. Personal Computing
in the late 1970s.
⢠Hitachi Basic Master
(1978)
â âThe first PC in Japanâ
â Motorola 6802--1Mhz,
16KB ROM, 16KB RAM
â Linpack in BASIC: Approx.
70-80 FLOPS (1/1,000,000)
⢠We got âsimulationâ done
(in assembly language)
â Nintendo NES (1982)
⢠MOS Technology 6502
1Mhz (Same as Apple II)
â âPinballâ by Matsuoka &
Iwata (now CEO Nintendo)
⢠Realtime dynamics +
collision + lots of shortcuts
⢠Average ~a few KFLOPS
Cf. Cray-1
Running Linpack 10
(1976)
Linpack
80-90MFlops
(est.)
5. Then things got accelerated
around the mid 80s to mid 90s
(rapid commoditization towards what we use now)
⢠PC CPUs: Intel 8086/286/386/486/Pentium (Superscalar&fast FP
x86), Motorola 68000/020/030/040, ⌠to Xeons, GPUs, Xeon Phiâs
â C.f. RISCs: SPARC, MIPS, PA-RISC, IBM Power, DEC Alpha, âŚ
⢠Storage Evolution: Cassettes, Floppies to HDDs, optical disk to Flash
⢠Network Evolution: RS-232C to Ethernet now to FDR Infinininband
⢠PC (incl. I/O): IBM PC âClonesâ and Macintoshes: ISA to VLB to PCIe
⢠Software Evolution: CP/M to MS-DOS to Windows, Linux,
⢠WAN Evolution: RS-232+Modem+BBS to Modem+Internet to
ISDN/ADSL/FTTH broadband, DWDM Backbone, LTE, âŚ
⢠Internet Evolution: email + ftp to Web, Java, Ruby, âŚ
⢠Then Clusters, Grid/Clouds, 3-D Gaming, and
Top500 all started in the mid 90s(!), and
commoditized supercomputing
6. Modern Day Supercomputers
ďŹ Now supercomputers âlook likeâ IDC
servers
ďŹ High-End COTS dominate
Linux based machine with standard + HPC OSS Software Stack
NEC Confidential
8. Top Supercomputers vs. Global IDC
K Computer (#1 2011-12) Riken-AICS
Fujitsu Sparc VIII-fx Venus CPU
88,000 nodes, 800,000CPU cores
~11 Petaflops (1016)
1.4 Petabyte memory, 13 MW Power
864 racksă3000m2
Tianhe2 (#1 2013) China Gwanjou
48,000 KNC Xeon Phi + 36,000 Ivy
Bridge Xeon
18,000 nodes, >3 Million CPU cores
54 Petaflops (1016)
0.8 Petabyte memory, 20 MW Power
??? racksă???m2
C.f. Amazon ~= 450,000 Nodes, ~3 million Cores
#1 2012 IBM BlueGene/Q âSequoiaâ
Lawrence Livermore National Lab
DARPA study
IBM PowerPC System-On-Chip
98,000 nodes, 1.57million Cores 2020 Exaflop (1018)
~20 Petaflops
100 million~
1.6 Petabytes, 8MW, 96 racks
NEC Confidential
1 Billion Cores
9. Scalability and Massive Parallelism
ďŹ More nodes & core => Massive Increase in
parallelism
ďľFaster, âBiggerâ Simulation
ďľQualitative Difference
Performance
BAD!
GOOD!
BAD!
Ideal Linear
Scaling Difficult
to Achieve
Limitations
in Power,
Cost,
Reliability
Limitations
in Scaling
CPU Cores ~= Parallelism
NEC Confidential
11. 2006: TSUBAME1.0
as No.1 in Japan
All University Centers
COMBINED 45 TeraFlops
>
Total 85 TeraFlops,
#7 Top500 June 2006
Earth Simulator
40TeraFlops #1 2002~2004
12. TSUBAME2.0 Nov. 1, 2010
âThe Greenest Production Supercomputer in the Worldâ
TSUBAME 2.0
New Development
32nm
40nm
>12TB/s Mem BW
>400GB/s Mem BW >1.6TB/s Mem BW
35KW Max
80Gbps NW BW
~1KW max
12
>600TB/s Mem BW
220Tbps NW
Bisecion BW
1.4MW Max
18. 2010: TSUBAME2.0 as No.1 in Japan
>
Total 2.4 Petaflops
#4 Top500, Nov. 2010
All Other Japanese
Centers on the Top500
COMBINED 2.3 PetaFlops
19. TSUBAME Wins AwardsâŚ
âGreenest Production
Supercomputer in the
Worldâ
the Green 500
Nov. 2010, June 2011
(#4 Top500 Nov. 2010)
3 times more power
efficient than a laptop!
20. TSUBAME Wins AwardsâŚ
ACM Gordon Bell Prize 2011
2.0 Petaflops Dendrite Simulation
Special Achievements in Scalability and Time-to-Solution
âPeta-Scale Phase-Field Simulation for Dendritic
Solidification on the TSUBAME 2.0 Supercomputerâ
21. TSUBAME Wins AwardsâŚ
Commendation for Sci &Tech by
Ministry of Education 2012
(ćé¨ç§ĺŚĺ¤§čŁčĄ¨ĺ˝°)
Prize for Sci & Tech, Development Category
Development of Greenest Production Peta-scale Supercomputer
Satoshi Matsuoka, Toshio Endo, Takayuki Aoki
22. Precise Bloodflow Simulation of Artery on
TSUBAME2.0
(Bernaschi et. al., IAC-CNR, Italy)
Personal CT Scan + Simulation
=> Accurate Diagnostics of Cardiac Illness
5 Billion Red Blood Cells + 10 Billion Degrees of
Freedom
23. MUPHY: Multiphysics simulation of blood flow
(Melchionna, Bernaschi et al.)
Combined Lattice-Boltzmann (LB)
simulation for plasma and Molecular
Dynamics (MD) for Red Blood Cells
Realistic geometry ( from CAT scan)
Multiphyics simulation
with MUPHY software
Fluid: Blood plasma
Lattice Boltzmann
Body: Red blood cell
coupled
Irregular mesh is divided by
using PT-SCOTCH tool,
considering cutoff distance
Extended MD
Red blood cells
(RBCs) are
represented as
ellipsoidal particles
Two-levels of parallelism: CUDA (on
GPU) + MPI
⢠1 Billion mesh node for LB
component
â˘100 Million RBCs 4000 GPUs,
ACM
Gordon Bell
Prize 2011
Honorable
Mention
0.6Petaflops
24. Lattice-Boltzmann-LES with
Coherent-structure SGS model
[Onodera&Aoki2013]
Coherent-structure Smagorinsky model
Second invariant of the velocity gradient
tensor(Q) and
the
Energy dissipation(Îľ)
The model parameter is locally determined by
second invariant of the velocity gradient tensor.
â Turbulent flow around a complex
object
â Large-scale parallel computation
Copyright Š Global Scientific Information and Computing Center, Tokyo Institute of Technology
25. Computational Area â Entire
Downtown Tokyo
Major part of Tokyo
Including Shnjuku-ku,
Chiyoda-ku, Minato-ku,
Meguro-ku, Chuou-ku,
Shinjyuku
Tokyo
10kmĂ10km
Building Data:
Pasco Co. Ltd.
TDM 3D
Achieved 0.592 Petaflops
using over 4000 GPUs
(15% efficiency)
Shibuya
Shinagawa
Map Š2012 Google, ZENRIN
Copyright Š Global Scientific Information and Computing Center, Tokyo Institute of Technology
25
26. Copyright Š Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
27. Area Around
Metropolitan Government Building
Flow profile at the 25m height on the ground
Wind
640Â m
960Â m
ĺ°ĺłăăźăż
Š2012 Google, ZENRIN
Copyright Š Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
27
28. Copyright Š Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
28
31. CFD analysis over a car body
Calculation conditions
ďź Number of grid pointsďź
3,623,878,656
(3,072 Ă 1,536 Ă 768)
ďźGrid resolutionďź4.2mm
(13m x 6.5 m x 3.25m)
ďźNumber of GPUsďź
288 (96 Nodes)
60 km/h
32. LBM: DriVer: BMW-Audi
Lehrstuhl fĂźr Aerodynamik und StrĂśmungsmechanik
Technische Universität Mßnchen
3,000x1,500x1,500
Re = 1,000,000
32
39. Towards TSUBAME 3.0
Interim Upgrade TSUBAME2.0 to 2.5
(Early Fall 2013)
⢠Upgrade the TSUBAME2.0s GPUs
NVIDIA Fermi M2050 to Kepler K20X
SFP/DFP peak from
4.8PF/2.4PF => 17PF/5.7PF
TSUBAME2.0 Compute Node
Fermi GPU 3 x 1408 = 4224 GPUs
c.f. The K Computer 11.2/11.2
Acceleration of Important Apps
Considerable Improvement
Summer 2013
Significant Capacity
Improvement at low cost
& w/o
Power Increase
TSUBAME3.0 2H2015
40. TSUBAME2.0â2.5 Thin Node Upgrade
Thin
Node
Infiniband QDR
x2 (80Gbps)
Peak Perf.
4.08 Tflops
~800GB/s
Mem BW
80GBps NW
~1KW max
HP SL390G7 (Developed for
TSUBAME 2.0, Modified for 2.5)
GPU: NVIDIA Kepler K20X x 3
1310GFlops, 6GByte Mem(per GPU)
Productized
as HP
ProLiant
SL390s
Modified for
TSUABME2.5
CPU: Intel Westmere-EP 2.93GHz x2
Multi I/O chips, 72 PCI-e (16 x 4 + 4 x 2)
lanes --- 3GPUs + 2 IB QDR
Memory: 54, 96 GB DDR3-1333
SSDďź60GBx2, 120GBx2
NVIDIAÂ FermiÂ
M2050
1039/515
GFlops
NVIDIAÂ Kepler
K20X
3950/1310
GFlops
41. 2013: TSUBAME2.5 No.1 in Japan in
Single Precision FP, 17 Petaflops
All University Centers
COMBINED 9 Petaflops SFP
~=
Total
17.1 Petaflops SFP
5.76 Petaflops DFP
K Computer
11.4 Petaflops SFP/DFP
42. TSUBAME2.0
TSUBAME2.5
Thin Node x 1408 Units
Node Machine
CPU
HP Proliant SL390s
Intel Xeon X5670
â No Change
â No Change
(6core 2.93GHz, Westmere) x 2
GPU
Node
Performance
(incl. CPU
Turbo boost)
NVIDIA Tesla M2050 x 3
ďŹ 448 CUDA cores (Fermi)
ď SFP 1.03TFlops
ď DFP 0.515TFlops
ďŹ 3GiB GDDR5 memory
ďŹ 150GB Peak, ~90GB/s
STREAM Memory BW
ďŹ SFP 3.40TFlops
ďŹ DFP 1.70TFlops
ďŹ ~500GB Peak, ~300GB/s
STREAM Memory BW
NVIDIA Tesla K20X x 3
ďŹ 2688 CUDA cores (Kepler)
ď SFP 3.95TFlops
ď DFP 1.31TFlops
ďŹ 6GiB GDDR5 memory
ďŹ 250GB Peak, ~180GB/s
STREAM Memory BW
ďŹ SFP 12.2TFlops
ďŹ DFP 4.08TFlops
ďŹ ~800GB Peak, ~570GB/s
STREAM Memory BW
TOTAL System
Total System
Performance
ďŹ SFP 4.80PFlops
ďŹ DFP 2.40PFlops
ďŹ Peak ~0.70PB/s, STREAM
~0.440PB/s Memory BW
ďŹ SFP 17.1PFlops (x3.6)
ďŹ DFP 5.76PFlops (x2.4)
ďŹ Peak ~1.16PB/s, STREAM
~0.804PB/s Memory BW (x1.8)
43. Phaseâfield simulation for Dendritic Solidification
[Shimokawabe, Aoki et. al.]
Weak scaling on TSUBAME (Single precision)
Mesh sizeďź1GPU+4 CPU coresďź:4096 x 162 x 130
TSUBAME 2.5
3.444 PFlops
(3,968 GPUs+15,872 CPU cores)
4,096 x 5,022 x 16,640
Developing lightweight strengtheningÂ
material by controlling microstructure
TSUBAME 2.0
2.000 PFlops
(4,000 GPUs+16,000 CPU cores)
Lowâcarbon society
4,096 x 6,480 x 13,000
â˘
â˘
PetaâScale phaseâfield simulations can simulate the multiple dendritic growth duringÂ
solidification required for the evaluation of new materials.
2011 ACM Gordon Bell Prize Special Achievements in Scalability and TimeâtoâSolution
44. Petaâscale stencil application :
A Largeâscale LES Wind Simulation using Lattice Boltzmann Method
[Onodera, Aoki et. al.]
Weak scalability in single precision
Large-scale Wind Simulation for a
10km x 10km Area in Metropolitan Tokyo
(N = 192 x 256 x 256)
Performance [TFlops]
10,080 x 10,240 x 512  (4,032 GPUs)
The above petaâscale simulations were executed as theÂ
TSUBAME Grand Challenge Program, Category A in 2012 fall.Â
â˘
â˘
ⲠTSUBAME 2.5 (overlap)
â TSUBAME 2.0 (overlap)
TSUBAME 2.5
1142 TFlops (3968 GPUs)
288 GFlops / GPU
x 1.93
TSUBAME 2.0
149 TFlops (1000 GPUs)
149 GFlops / GPU
Number of GPUs
The LES wind simulation for the area 10km à 10km with 1âm resolution has neverÂ
been done before in the world.Â
We achieved 1.14 PFLOPS using 3968 GPUs on the TSUBAME 2.5 supercomputer.
47. TSUBAME Evolution
Towards Exascale and Extreme Big Data
25â30PF
1TB/s
5.7PF
Graph 500
No. 3 (2011)
Awards
3.0
Phase2
Fast I/O
2.5
5~10PB
Phase1
10TB/s
Fast I/O
>Â 100milÂ
250TB
300GB/s iOPs
30PB/Day 1ExaB/Day
2015H2
Copyright Š Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology
47
48. x
DoE Exascale Parameters
x1000 power efficiency in 10 years
System
attributes
â2010â
â2015â
System peak
2 PetaFlops
100-200 PetaFlops 1 ExaFlop
Power
Jaguar
TSUBAME
6 MW
1.3 MW
15 MW
20 MW
System Memory
0.3PB
0.1PB
5 PB
32-64PB
Node Perf
125GF
1.6TF
0.5TF
Node Mem BW
25GB/s
Node Concurrency
#Nodes
1TF
10TF
0.5TB/s 0.1TB/s 1TB/s
0.4TB/s
4TB/s
12
O(1000) O(100)
O(1000)
O(1000)
18,700
1442
50,000
5,000
1 million
8GB/s
20GB/s
200GB/s
O(1 day)
O(1 day)
Total Node
1.5GB/s
Interconnect BW
MTTI
â2020â
O(days)
7TF
O(10000
)
Billion Cores
100,000
50. Focused Research Towards
Tsubame 3.0 and Beyond towards Exa
⢠Green Computing: Ultra Power Efficient HPC
⢠High Radix Bisection Networks â HW, Topology, Routing
Algorithms, PlacementâŚ
⢠Fault Tolerance â Group-based Hierarchical
Checkpointing, Fault Prediction, Hybrid Algorithms
⢠Scientific âExtremeâ Big Data â Ultra Fast I/O, Hadoop
Acceleration, Large Graphs
⢠New memory systems â Pushing the envelops of low
Power vs. Capacity vs. BW, exploit the deep hierarchy
with new algorithms to decrease Bytes/Flops
⢠Post Petascale Programming â OpenACC and other manycore programming substrates, Task Parallel
⢠Scalable Algorithms for Many Core â
Apps/System/HW Co-Design
51. JSTâCREST âUltra Low Power (ULP)âHPCâ ProjectÂ
2007â2012
Ultra Multi-Core
Slow & Parallel
(& ULP)
AutoâTuning for Perf. & Power
ULPâHPCÂ SIMDâ
Vector
(GPGPU, etc.)
ABCLibScript: ă˘ăŤă´ăŞăşă é¸ć
ă˘ăăŤă¨ĺŽć¸ŹăŽ Bayes çčĺ
ĺŽčĄčľˇĺĺčŞĺăăĽăźăăłă°ćĺŽă
⢠Bayes ă˘ăăŤă¨äşĺĺĺ¸
ă˘ăŤă´ăŞăşă é¸ćĺŚçăŽćĺŽ
!ABCLib$ static select region start
!ABCLib$ parameter (in CacheS, in NB, in NPrc)
ăłăšăĺŽçžŠé˘ć°ă§ä˝żăăă
!ABCLib$
select sub region start
ĺ Ľĺĺ¤ć°
!ABCLib$
according estimated
!ABCLib$
(2.0d0*CacheS*NB)/(3.0d0*NPrc)
ă˘ăăŤăŤăă
ćčŚćéăŽć¨ĺŽ
yi ~ N ( ď i , ďł )
2
i
ďi | ď˘ , ďł i2 ~ N ( xiT ď˘ , ďł i2 / ďŤ 0 )
ăłăšăĺŽçžŠé˘ć°
ďł i2 ~ Inv - ďŁ 2 (v0 , ďł 02 )
ULPâHPC
Networks
MRAM
PRAM
Flash
etc.
ćčŚćéăŽĺŽć¸Źăăźăż
⢠n ĺĺŽć¸ŹĺžăŽäşĺžäşć¸Źĺĺ¸
yi | ( yi1 , yi 2 ,ď , yin ) ~ tďŽ n ( ďin , ďł ďŤ n ďŤ1 / ďŤ n )
2
in
0
ďŽ n ď˝ ďŽ 0 ďŤ n, ďŤ n ď˝ ďŤ 0 ďŤ n, ď n ď˝ (ďŤ 0 xiT ď˘ ďŤ nyi ) / ďŤ n
Low PowerÂ
High Perf Model
Power Optimize using Novel Components
in HPC
2
ďŽ nďł n ď˝ ďŽ 0ďł 02 ďŤ ďĽ ( ym ď yi ) 2 ďŤ ďŤ 0 n( yi ď xiT ď˘ ) 2 / ďŤ n
寞蹥ďźďźă˘ăŤă´ăŞăşă ďźďź
select sub region end
select sub region start
according estimated
(4.0d0*ChcheS*dlog(NB))/(2.0d0*NPrc)
寞蹥ďźďźă˘ăŤă´ăŞăşă ďźďź
!ABCLib$ select sub region end
!ABCLib$ static select region end
!ABCLib$
!ABCLib$
!ABCLib$
!ABCLib$
1
1
yi ď˝ ďĽ yim
n
OptimizationÂ
Point
Power
Perf
x10Â PowerÂ
Efficiencty
0
Perf. Model
Algorithms
PowerâAware and Optimizable Applications
寞蹥é ĺ1ă2
x1000
Improvement in 10 years
52. Aggressive Power Saving in HPC
Methodologies
Enterprise/Business
Clouds
HPC
Server Consolidation
Good
NG!
Good
Poor
Poor
Good
Poor
Good
Limited
Good
DVFS
(Dynamic Voltage/Frequency
Scaling)
New Devices
New HW &SW
Architecture
Novel Cooling
(Cost & Continuity)
(Cost & Continuity)
(Cost & Continuity)
(high thermal density)
53. How do we achive x1000?
Process Shrink x100
X
Many-Core GPU Usage x5
X
DVFS & Other LP SW x1.5
X
Efficient Cooling x1.4
x1000 !!!
ULP-HPC
Project
2007-12
Ultra Green
Supercomputing
Project 2011-15
59. Extreme Big Data (EBD)
Next Generation Big Data
Infrastructure Technologies Towards
Yottabyte/Year
Principal Invesigator
Satoshi Matsuoka
Global Scientific Information and
Computing Center
Tokyo Institute of Technolgoy
60. The current âBig Dataâ are not
really that BigâŚ
⢠Typical ârealâ definition: âMining peopleâs privacy
data to make moneyâ
⢠Corporate data are usually in data warehoused silo > limited volume, in Gigabytes~Terabytes, seldom
Petabytes.
⢠Processing involve simple O(n) algorithms, or those
that can be accelerated with DB-inherited indexing
algorithms
⢠Executed on re-purposed commodity âwebâ servers
linked with 1Gbps networks running Hadoop/HDFS
⢠Vicious cycle of stagnation in innovationsâŚ
⢠NEW: Breaking down of Silos â Convergence
with Supercomputing with Extreme Big Data
61. But âExtreme Big Dataâ will
change everything
⢠âBreaking down of Silosâ (Rajeeb Harza,
Intel VP of Technical Computing)
⢠Already happening in Science &
Engineering due to Open Data movement
⢠More complex analysis algorithms: O(n
log n), O(m x n), âŚ
⢠Will become the NORM for
competitiveness reasons.
67. Towards Continuous Billion-Scale Social Simulation with
Real-Time Streaming Data (Toyotaro Suzumura/IBM-Tokyo
Tech)
ď§ Applications
â Target Area: Planet (Open Street Map)
â 7 billion people
ď§ Input Data
â Road Network (Open Street Map) for
Planet: 300 GB (XML)
â Trip data for 7 billion people
⢠10 KB (1 trip) x 7 billion = 70 TB
â Real-Time Streaming Data (e.g. Social
sensor, physical data)
ď§ Simulated Output for 1 Iteration
â
700 TB
70. But what does â220Tbpsâ mean?
Global IP Traffic, 2011-2016 (Source Cicso)
2011
2012
2013
2014
2015
2016
CAGR
2011-2016
By Type (PB per Month / Average Bitrate in Tbps)
Fixed
Internet
Manage
d IP
Mobile
data
Total IP
traffic
23,288
71.9
6,849
21.1
597
1.8
30,734
94.9
32,990
101.8
9,199
28.4
1,252
3.9
43,441
134.1
40,587
125.3
11,846
36.6
2,379
7.3
54,812
169.2
50,888
157.1
13,925
43.0
4,215
13.0
69,028
213.0
64,349 81,347
198.6
251.1
16,085 18,131
49.6
56.0
6,896 10,804
21.3
33.3
87,331 110,282
269.5
340.4
TSUBAME2.0 Network has TWICE
the capacity of the Global Internet,
being used by 2.1 Billion users
NEC Confidential
28%
21%
78%
29%
72. What does this all mean?
⢠âLeveraging of mainframe technologies in HPC has
been dead for some time.â
⢠But will leveraging Cloud/Mobile be sufficient?
⢠NO! They are already falling behind, and will be
perpetually behind
â CAGR of Clouds 30%, HPC 100%: all data supports it
â Stagnation in network, storage, scaling, âŚ
⢠Rather, HPC will be the technology driver for
future Big Data, for Cloud/Mobile to leverage!
â Rather than repurposed standard servers
73. Future âExtreme Big Dataâ
- NOT mining Tbytes Silo Data
-
Peta~Zetabytes of Data
Ultra High-BW Data Stream
Highly Unstructured, Irregular
Complex correlations between
data from multiple sources
- Extreme Capacity, Bandwidth,
73
Compute All Required
74. [Slide courtesy Alok Choudhary, Northeastern
Extreme Big Data not just traditional HPC!!! U
--- Analysis of required system properties
---
74
Extreme-Scale Computing
Big Data Analytics
BDEC Knowledge Discovery Engine
Processor Speed
1
Algorithmic Variety
Memory/ops
0.8
0.6
Power Optimization Opportunities
OPS
0.4
0.2
0
Comm patterns variability
Approximate Computations
Comm Latency tolerance
Write Performance
Local Persistent Storage
Read Performance
75. EBE Research Scheme
Future Non-Silo Extreme Big Data Apps
Ultra Large Scale
Graphs and Social
Infrastructures
Large Scale
Metagenomics
Co-Design
EBDÂ Bag
Massive Sensors and
Data Assimilation in
Weather Prediction
Co-Design Co-Design
EBD System Software
incl. EBD Object System
Cartesian Plane
KV
S
KV
S
Graph Store
NVM/Fla
NVM/Flas
2Tbps HBM
NVM/Fla
sh
h
4~6HBM Channels NVM/Flas
NVM/Fla
NVM/Flas
sh
h
1.5TB/s DRAM & h
sh
DRAM
DRAM
NVM BW
DRAM
DRAM
DRAM
DRAM
Low
Low
High Powered
30PB/s I/O BW Possible
Main CPU
Power
Power
1 Yottabyte / YearCPU
CPU
TSV Interposer
KV
S
EBDÂ KVS
Exascale Big Data HPC
PCB
Convergent Architecture (Phases 1~4)
Large Capacity NVM, High-Bisection NW
Cloud IDC
Very low BW & Efficiencty
Supercomputers
Compute&Batch-Oriented
79. High Performance Sorting
Fast algorithms:
Distribution vs Comparison-based
N log(N) classical sorts
(quick, merge etc)
MSD radix sort
LSD radix sort
(THRUST)
variable length /
short length /
long keys
high efficiency on small fixed-length keys
alphabets
apple
apricot
banana
kiwi
Scalability
Bitonic sort
Computational
don't have to examine
Genomics
all characters
(A,C,G,T)
Comparison of keys
Map-Reduce
Hadoop easy to use but
not
that efficient
integer sorts
Efficient
implementation
GPUs are good
at counting
numbers
Hybrid approaches/
Best to be found
Good for GPU nodes
Balancing IO / computation
â
82. Summary
⢠TSUBAME1.0->2.0->2.5->3.0->âŚ
â Tsubame 2.5 Number 1 in Japan, 17 Petaflops SFP
â Template for future supercomputers and IDC machines
⢠TSUBAME3.0 Early 2016
â New supercomputing leadership
â Tremendous power efficiency, extreme big data,
extremely high reliability
⢠Lots of background R&D for TSUBAME3.0 and
towards Exascale
â
â
â
â
â
Green Computing: ULP-HPC & TSUBAME-KFC
Extreme Big Data â Convergence of HPC and IDC!
Exascale Resilience
Programming with Millions of Cores
âŚ
⢠Please stay tuned! äšăăćĺž ăĺżć´ăăéĄăăăžăă