SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
/
C
2 7 1 :
1
8 0
2
. , ,
AI
3
O c x
Z / 6
o Z
t k N x v Z
N G E N G E /
v Z dfl
/KIJH N )6
tk
e fae O
/BEKM 6
E BEB
kn O
6 R O
fn Oe
8 R
X g
OiO
d N/BEKMb lp
JA HE J
kn O
O
rOo
fn Oe
M
v Z dfl
+ )6
G K v O O
6G HC + FFG
O c x
FIJ H I /
Z
/DB
AFKJ
v
Z
H GA8
BH GA
L N 6 D R
6 /
e
+BL B
DFK F6 /
+ I II E H FE F
bO Od e
FF G H
/BEKM 6
oP Oi Q fub P+Xx aOd
dfl hvn X
sOo X
6
)FHJH EN N R
• IW b
• P e
• IW b G C
U bG /
• P G b G C
b H A b
G /
• IW U P
• P U
• IW
• P H
• IW P G
• P P G
U hvn X bdfl
W
U o fub U S
AI Bridging Cloud Infrastructure
as World’s First Large-scale Open AI Infrastructure
• Open, Public, and Dedicated infrastructure for AI/Big Data
• Platform to accelerate joint academic-industry R&D for AI in Japan
• Top-level compute capability w/ 0.550EFlops(AI), 37.2 PFlops(DP)
( )2 5 0.1 ,. 10 5 ## 0 #
4
Univ. Tokyo Kashiwa II Campus
Operation Scheduled in 2018
• 1088x compute nodes w/ 4352x NVIDIA Tesla V100 GPUs, 43520 CPU Cores,
476TiB of Memory, 1.6PB of NVMe SSDs, 22PB of HDD-based Storage and
Infiniband EDR network
• Ultra-dense IDC design from the ground-up w/ 20x thermal density of standard IDC
• Extreme Green w/ ambient warm liquid cooling, high-efficiency power supplies, etc.,
commoditizing supercomputer cooling technologies to clouds ( 2.3MW, 70kW/rack)
5
Gateway and
Firewall
Computing Nodes: 0.550 EFlops(HP), 37 PFlops(DP)
476 TiB Mem, 1.6 PB NVMe SSD
Storage: 22 PB GPFS
High Performance Computing Nodes (w/GPU) x1088
• Intel Xeon Gold6148 (2.4GHz/20cores) x2
• NVIDIA Tesla V100 (SXM2) x 4
• 384GiB Memory, 1.6TB NVMe SSD
Multi-platform Nodes (w/o GPU) x10
• Intel Xeon Gold6132 (2.6GHz/14cores) x2
• 768GiB Memory, 3.8TB NVMe SSD
Interactive Nodes
DDN SFA14K
(w/ SS8462 Enclosure x 10) x 3
• 12TB 7.2Krpm NL-SAS HDD x 2400
• 3.84TB SAS SSD x 216
• NSD Servers x 12
Object Storage for Protocol Nodes
100GbE
Service Network (10GbE)
External
Networks
SINET5
Interconnect (Infiniband EDR)
ABCI: AI Bridging Cloud Infrastructure
6
System
(32 Racks)Rack
(17 Chassis)
Compute Node
(4GPUs, 2CPUs)
Chips
(GPU, CPU)
Node Chassis
(2 Compute Nodes)
NVIDIA Tesla V100
(16GB SMX2)
3.72 TB/s MEM BW
384 GiB MEM
200 Gbps NW BW
1.6TB NVMe SSD
1.16 PFlops(DP)
17.2 PFlops (AI)
37.2 PFlops(DP)
0.550 EFlops (AI)
68.5 PFlops(DP)
1.01 PFlops (AI)
34.2 TFlops(DP)
506 TFlops (AI)
GPU:
7.8 TFlops(DP)
125 TFlops (AI)
CPU:
1.53 TFlops(DP)
3.07 TFlops (AI)
Intel Xeon Gold 6148
(27.5M Cache,
2.40 GHz, 20 Core)
0.550 EFlops(AI), 37.2 PFlops(DP)
19.88 PFlops(Peak), Ranked #5 Top500 June 2018
131TB/s MEM BW
Full Bisection BW within Rack
70kW Max
1088 Compute Nodes
4352 GPUs
4.19 PB/s MEM BW
1/3 of Oversubscription BW
2.3MW
GPU Compute Nodes
• NVIDIA TESLA V100
(16GB, SXM2) x 4
• Intel Xeon Gold 6148
x 2 Sockets
– 20 cores per Socket
• 384GiB of DDR4 Memory
• 1.6TB NVMe SSD x 1
– Intel DC P4600 u.2
• EDR Infiniband HCA x 2
– Connected to other Compute Notes
and Filesystems
7
Xeon Gold
6148
Xeon Gold
6148
10.4GT/s x3DDR4-2666
32GB x 6
DDR4-2666
32GB x 6
128GB/s 128GB/s
IB HCA (100Gbps)IB HCA (100Gbps)
NVMe
UPI x3
x48 switch x64 switch
Tesla V100 SXM2 Tesla V100 SXM2
Tesla V100 SXM2 Tesla V100 SXM2
PCIe gen3 x16 PCIe gen3 x16
PCIe gen3 x16 PCIe gen3 x16
NVLink2 x2
Rack as Dense-packaged “Pod”
( AB < ) 1 6
) 0 BC / 0 BC ,3
0G < F BA -7
BH<D G D CF BA -7 FB
<IF<DA CB
) 7 7 4 I
7 F<D , D 2 D .BB A
8Pod #1
LEAF#1
(SB7890)
LEAF#2
(SB7890)
LEAF#3
(SB7890)
LEAF#4
(SB7890)
SPINE#1
(CS7500)
SPINE#2
(CS7500)
CX40
0#1
CX2570#1
CX2570#2
CX40
0#2
CX2570#3
CX2570#4
CX40
0#3
CX2570#5
CX2570#6
CX40
0#17
CX2570#33
CX2570#34
FBB#1
(SB7890)
FBB#2
(SB7890)
FBB#3
(SB7890)
1/3 Oversubscription BW
IB-EDR x 24
Full bisection BW
IB-EDR x 72
InfiniBand EDR x1
InfiniBand EDR x6
InfiniBand EDR x4
x 32 pods
Hierarchical Storage Tiers
• Local Storage
– 1.6 TB NVMe SSD (Intel DC P4600 u.2) per Node
– Local Storage Aggregation w/ BeeOnd
• Parallel Filesystem
– 22PB of GPFS
• DDN SFA14K ( w/ SS8462 Enclosure x 10) x 3 set
• Bare Metal NSD servers and Flash-based Metadata
Volumes for metadata operation acceleration
– Home and Shared Use
• Object Storage
– Part of GPFS using OpenStack Swift
– S3-like API Access, Global Shared Use
– Additional Secure Volumes w/ Encryption
(Planned)
9
Parallel Filesystem
Local Storage
as Burst Buffers
Object Storage as Campaign Storage
Performance Reference for Distributed Deep Learning
10
Better• Environments
– ABCI 64 nodes (256 GPUs)
– Framework: ChainerMN v1.3.0
• Chainer 4.2.0, Cupy 4.2.3, mpi4py 3.0.0, Python 3.6.5
– Baremetal
• CentOS 7.4, gcc-4.8.5,
CUDA 9.2, CuDNN 7.1.4, NCCL2.2, OpenMPI 2,1.3
• Settings
– Dataset: Imagenet-1K
– Model: ResNet-50
– Training:
• Batch size: 32 per GPU, 32 x 256 in total
• Learning Rate: Starting 0.1 and x0.1 at 30, 60, 80 epoch
w/ warm up scheduling
• Optimization: Momentum SGD (momentum=0.9)
• Weight Decay: 0.0001
• Training Epoch: 100
LGD M
P
C N D M
A D U
N , L
A D
I M C
11
/home (GPFS)
Job Job Job Job
NQS
Submit
Scheduling
script
file
$ qsub <option> script_filename
inter-connect
SSH
( )
G
(High Throughput Computing)
/: /
D 0 1 72 A :
/: /
D / 2
D 172 2
.2: , 2
,C : ,/17/ 2 2 1
inkbc
augk t v y
12
, Hgmeu P
• 172 fp k
Hh rs wogk L Q
• lr t augk t T
.
,
13
(cont’d)
CUDA8.0
8.0.44 8.0.61.2
CUDA9.0
9.0.176
CUDA9.1
9.1.85 9.1.85.1 9.1.85.3
CuDNN5.1
5.1.5 5.1.10
CuDNN6.0
6.0.21
CuDNN7.0
7.0.5
CuDNN7.1
7.1.1 7.1.2 7.1.3
CUDA9.2
9.2.88.1
NCCL1.3
1.3.4 1.3.40-1
NCCL2.1NCCL2.0 NCCL2.2
2.0.5-3 2.1.4-1 2.1.15-1 2.2.12
OpenMPI MVAPICH2-GDR2.1.3 3.0.1 3.1.0 2.3a
Python 2.7 3.5 3.6
Python Modules mpi4py matplotlibCython Pillow Jupyter
Caffe2 CNTK ChainerMN Tensorflow MXNetNnabla
Software Stuck for ABCI
• Batch Job Scheduler
– High throughput computing
• Minimum Pre-installed Software
– Users can deploy their environments
using anaconda, pip, python venv, etc.
– Reduce operational cost
• Container Support
– Singularity for multi-node jobs w/
user customized images
– Docker for single-node jobs w/
site-certified images
14
User Applications
DL
Frameworks
Python, Ruby, R, Java, Scala, Perl, Lua, etc.
ISV AppsOSSHadoop/
Spark
GCC PGI
OpenMPI MVAPICH2
Intel Parallel
Studio XE
Cluster Edition
CUDA/CUDNN/NCCL
GPFS BeeOND
OpenStack
Swift
Univa Grid Engine
Singularity Docker
CentOS/RHEL
ABCI : Dynamic Container Deployment
with HPC Linux Containers
Linux Container (Singulairty, Docker)
GPFS/Object Storage
Compute
Node
Compute
Node
Compute
Node
Compute
Node
Container
Image
Container
Image
Container
Image
Container
Image
Job Job Job Job
Job Scheduler
Container
Image
Register/copy container images
Import/copy container images
Submit jobs with container images
Container image repository
(Dockerhub, private registry)
P D
. ,
H O I
CDHR D HAU C
M P
16
CharlieCloud
HPCEnterprise
(
(
D u Mvtr B pae
C . :
. /: CA C
m p
D l Mgo mi ws cn PS
:: M b kpI
M :: HL hpd ws
D :/ I y
:/ . x
D fcU S I y
, . /
17
( )
sudo singularity build –sandbox tmpdir/ Singularity
sudo singularity build –writable container.img Singularity
sudo singularity build container.img Singularity
sudo singularity build container.img docker://ubuntu
sudo singularity build container.img shub://ubuntu
) S
R
sudo singularity shell –writable container.img
D H
R
(, R
( (
, , , ,
container.img
h
S g
singularity run container.img
singularity exec container.img …
singularity shell container.img
a
Sc
) (
e e
:
/. # $ / . $
-$
-$ . $
S
19
4 , C C GHI
4 4 ,4 M :
c s a mn_e a e e s
P B . . , kt
M :
c sP B Md v
lN g I
Bc s i ro . i NI up
4. 4 . - . , , . , . - 4. 4 -
M :
lN g
l N 4. 4 g
20
- :
- D
0- 8 D H 0-
6/ .8 / 8/ / - / .
0- / / 8 $ 8 - -
6/ .8 / -:- / - / . / / 6 /
21
GPU --nv
C B 3A3 : 3
0 :
M
C $ : 3 an f uV
- a gi pN
C $ c v - fo : 3
r P H ts I
y m i NN _ O
3 - B leo
. - m i x
22
S y x
R 40 HIA ) 29
R m in - 03 H K67 O
S 0D H K 0N G 9 MDIH )
R oad
S HCN K M O
S NHMN ) C ( C
R g l
S 0 HM8 C ( C
S yup
R b c- 4G C H M 5
R bl- : 7 M (
R -
S e W - 29 U P )
S s - I D U
S t r- 6IG HMNG 21 GIG HMNG. ,
S = CDM 1 -
S h v-
23
:
Better:
L i
C 2 2 , 2 e D
N GLbdP G GU
M iA
C 2 2 , 2 e
2 - ak
M i
P GU
G L
, ,
o n I
G L
c P
24
Base Drivers, Libraries on Host
CUDA
Drivers
Infiniband
Drivers
Filesystem
Libraries
(GPFS,
Lustre)
Userland Libraries on Container
CUDA CuDNN NCCL2
MPI
(mpi4py)
Mount
ibverbs
Distributed Deep Learning
Frameworks
Caffe2 ChainerMNDistributed
TensorflowMXNet
25
/ . ) /
m S C
O = y t = i
sn f
y Sun v
S C ,
=. / St i S C
. / • C
I
/. , ) , ) , / ) O >– iU >
ir ga
lu bc S S
ke o a
26
A P oce ʼ A
Oʼ A L ʼ t
uO P
V P d A P
S r ʼ
oceA –
uO V V R
I • O x O o n M A
() P ao
oceO P A
– i dP• oce O od L
i dO () o d n oce O od
27
ei S Ll B
sI
B BCA I g e
/ R
/
ro Ra
ei A L n ut
28
29

Weitere ähnliche Inhalte

Was ist angesagt?

Supermicro High Performance Enterprise Hadoop Infrastructure
Supermicro High Performance Enterprise Hadoop InfrastructureSupermicro High Performance Enterprise Hadoop Infrastructure
Supermicro High Performance Enterprise Hadoop Infrastructuretempledf
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_ProcessingKohei KaiGai
 
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...Kohei KaiGai
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015Kohei KaiGai
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)Kohei KaiGai
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_PlaceKohei KaiGai
 
FDW-based Sharding Update and Future
FDW-based Sharding Update and FutureFDW-based Sharding Update and Future
FDW-based Sharding Update and FutureMasahiko Sawada
 
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)Kohei KaiGai
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
Vacuum more efficient than ever
Vacuum more efficient than everVacuum more efficient than ever
Vacuum more efficient than everMasahiko Sawada
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdwKohei KaiGai
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)Kohei KaiGai
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Ural-PDC
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStromKohei KaiGai
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_IndexKohei KaiGai
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 

Was ist angesagt? (20)

Supermicro High Performance Enterprise Hadoop Infrastructure
Supermicro High Performance Enterprise Hadoop InfrastructureSupermicro High Performance Enterprise Hadoop Infrastructure
Supermicro High Performance Enterprise Hadoop Infrastructure
 
20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing20201006_PGconf_Online_Large_Data_Processing
20201006_PGconf_Online_Large_Data_Processing
 
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
GPU/SSD Accelerates PostgreSQL - challenge towards query processing throughpu...
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
FDW-based Sharding Update and Future
FDW-based Sharding Update and FutureFDW-based Sharding Update and Future
FDW-based Sharding Update and Future
 
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
Vacuum more efficient than ever
Vacuum more efficient than everVacuum more efficient than ever
Vacuum more efficient than ever
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw20171206 PGconf.ASIA LT gstore_fdw
20171206 PGconf.ASIA LT gstore_fdw
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
 
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index20210301_PGconf_Online_GPU_PostGIS_GiST_Index
20210301_PGconf_Online_GPU_PostGIS_GiST_Index
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 

Ähnlich wie AI Bridging Cloud Infrastructure as World's First Large-scale Open AI Infrastructure

Hortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AIHortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AIDataWorks Summit
 
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Community
 
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...Equnix Business Solutions
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univainside-BigData.com
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Fisnik Kraja
 
Delivering Supermicro Software Defined Storage Solutions with OSNexus QuantaStor
Delivering Supermicro Software Defined Storage Solutions with OSNexus QuantaStorDelivering Supermicro Software Defined Storage Solutions with OSNexus QuantaStor
Delivering Supermicro Software Defined Storage Solutions with OSNexus QuantaStorRebekah Rodriguez
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InSage Weil
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.
 
Speedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderSpeedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderGregSmith458515
 
High Performance Computing for LiDAR Data Production
High Performance Computing for LiDAR Data ProductionHigh Performance Computing for LiDAR Data Production
High Performance Computing for LiDAR Data ProductionMattBethel1
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephRongze Zhu
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big DataDataStax Academy
 

Ähnlich wie AI Bridging Cloud Infrastructure as World's First Large-scale Open AI Infrastructure (20)

Hortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AIHortonworks on IBM POWER Analytics / AI
Hortonworks on IBM POWER Analytics / AI
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective Ceph Day New York 2014: Ceph, a physical perspective
Ceph Day New York 2014: Ceph, a physical perspective
 
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
PGConf.ASIA 2019 Bali - AppOS: PostgreSQL Extension for Scalable File I/O - K...
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
Advances in GPU Computing
Advances in GPU ComputingAdvances in GPU Computing
Advances in GPU Computing
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
Delivering Supermicro Software Defined Storage Solutions with OSNexus QuantaStor
Delivering Supermicro Software Defined Storage Solutions with OSNexus QuantaStorDelivering Supermicro Software Defined Storage Solutions with OSNexus QuantaStor
Delivering Supermicro Software Defined Storage Solutions with OSNexus QuantaStor
 
BlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year InBlueStore, A New Storage Backend for Ceph, One Year In
BlueStore, A New Storage Backend for Ceph, One Year In
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...
 
Speedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderSpeedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql Loader
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
High Performance Computing for LiDAR Data Production
High Performance Computing for LiDAR Data ProductionHigh Performance Computing for LiDAR Data Production
High Performance Computing for LiDAR Data Production
 
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on CephBuild an High-Performance and High-Durable Block Storage Service Based on Ceph
Build an High-Performance and High-Durable Block Storage Service Based on Ceph
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
 

Mehr von Hitoshi Sato

Singularityで分散深層学習
Singularityで分散深層学習Singularityで分散深層学習
Singularityで分散深層学習Hitoshi Sato
 
第162回情報処理学会ハイパフォーマンスコンピューティング研究発表会
第162回情報処理学会ハイパフォーマンスコンピューティング研究発表会第162回情報処理学会ハイパフォーマンスコンピューティング研究発表会
第162回情報処理学会ハイパフォーマンスコンピューティング研究発表会Hitoshi Sato
 
産総研AIクラウドでChainerMN
産総研AIクラウドでChainerMN産総研AIクラウドでChainerMN
産総研AIクラウドでChainerMNHitoshi Sato
 
MemoryPlus Workshop
MemoryPlus WorkshopMemoryPlus Workshop
MemoryPlus WorkshopHitoshi Sato
 

Mehr von Hitoshi Sato (6)

Singularityで分散深層学習
Singularityで分散深層学習Singularityで分散深層学習
Singularityで分散深層学習
 
第162回情報処理学会ハイパフォーマンスコンピューティング研究発表会
第162回情報処理学会ハイパフォーマンスコンピューティング研究発表会第162回情報処理学会ハイパフォーマンスコンピューティング研究発表会
第162回情報処理学会ハイパフォーマンスコンピューティング研究発表会
 
GTC Japan 2017
GTC Japan 2017GTC Japan 2017
GTC Japan 2017
 
産総研AIクラウドでChainerMN
産総研AIクラウドでChainerMN産総研AIクラウドでChainerMN
産総研AIクラウドでChainerMN
 
MemoryPlus Workshop
MemoryPlus WorkshopMemoryPlus Workshop
MemoryPlus Workshop
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 

Kürzlich hochgeladen

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 

Kürzlich hochgeladen (20)

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 

AI Bridging Cloud Infrastructure as World's First Large-scale Open AI Infrastructure

  • 1. / C 2 7 1 : 1 8 0
  • 3. 3 O c x Z / 6 o Z t k N x v Z N G E N G E / v Z dfl /KIJH N )6 tk e fae O /BEKM 6 E BEB kn O 6 R O fn Oe 8 R X g OiO d N/BEKMb lp JA HE J kn O O rOo fn Oe M v Z dfl + )6 G K v O O 6G HC + FFG O c x FIJ H I / Z /DB AFKJ v Z H GA8 BH GA L N 6 D R 6 / e +BL B DFK F6 / + I II E H FE F bO Od e FF G H /BEKM 6 oP Oi Q fub P+Xx aOd dfl hvn X sOo X 6 )FHJH EN N R • IW b • P e • IW b G C U bG / • P G b G C b H A b G / • IW U P • P U • IW • P H • IW P G • P P G U hvn X bdfl W U o fub U S
  • 4. AI Bridging Cloud Infrastructure as World’s First Large-scale Open AI Infrastructure • Open, Public, and Dedicated infrastructure for AI/Big Data • Platform to accelerate joint academic-industry R&D for AI in Japan • Top-level compute capability w/ 0.550EFlops(AI), 37.2 PFlops(DP) ( )2 5 0.1 ,. 10 5 ## 0 # 4 Univ. Tokyo Kashiwa II Campus Operation Scheduled in 2018
  • 5. • 1088x compute nodes w/ 4352x NVIDIA Tesla V100 GPUs, 43520 CPU Cores, 476TiB of Memory, 1.6PB of NVMe SSDs, 22PB of HDD-based Storage and Infiniband EDR network • Ultra-dense IDC design from the ground-up w/ 20x thermal density of standard IDC • Extreme Green w/ ambient warm liquid cooling, high-efficiency power supplies, etc., commoditizing supercomputer cooling technologies to clouds ( 2.3MW, 70kW/rack) 5 Gateway and Firewall Computing Nodes: 0.550 EFlops(HP), 37 PFlops(DP) 476 TiB Mem, 1.6 PB NVMe SSD Storage: 22 PB GPFS High Performance Computing Nodes (w/GPU) x1088 • Intel Xeon Gold6148 (2.4GHz/20cores) x2 • NVIDIA Tesla V100 (SXM2) x 4 • 384GiB Memory, 1.6TB NVMe SSD Multi-platform Nodes (w/o GPU) x10 • Intel Xeon Gold6132 (2.6GHz/14cores) x2 • 768GiB Memory, 3.8TB NVMe SSD Interactive Nodes DDN SFA14K (w/ SS8462 Enclosure x 10) x 3 • 12TB 7.2Krpm NL-SAS HDD x 2400 • 3.84TB SAS SSD x 216 • NSD Servers x 12 Object Storage for Protocol Nodes 100GbE Service Network (10GbE) External Networks SINET5 Interconnect (Infiniband EDR)
  • 6. ABCI: AI Bridging Cloud Infrastructure 6 System (32 Racks)Rack (17 Chassis) Compute Node (4GPUs, 2CPUs) Chips (GPU, CPU) Node Chassis (2 Compute Nodes) NVIDIA Tesla V100 (16GB SMX2) 3.72 TB/s MEM BW 384 GiB MEM 200 Gbps NW BW 1.6TB NVMe SSD 1.16 PFlops(DP) 17.2 PFlops (AI) 37.2 PFlops(DP) 0.550 EFlops (AI) 68.5 PFlops(DP) 1.01 PFlops (AI) 34.2 TFlops(DP) 506 TFlops (AI) GPU: 7.8 TFlops(DP) 125 TFlops (AI) CPU: 1.53 TFlops(DP) 3.07 TFlops (AI) Intel Xeon Gold 6148 (27.5M Cache, 2.40 GHz, 20 Core) 0.550 EFlops(AI), 37.2 PFlops(DP) 19.88 PFlops(Peak), Ranked #5 Top500 June 2018 131TB/s MEM BW Full Bisection BW within Rack 70kW Max 1088 Compute Nodes 4352 GPUs 4.19 PB/s MEM BW 1/3 of Oversubscription BW 2.3MW
  • 7. GPU Compute Nodes • NVIDIA TESLA V100 (16GB, SXM2) x 4 • Intel Xeon Gold 6148 x 2 Sockets – 20 cores per Socket • 384GiB of DDR4 Memory • 1.6TB NVMe SSD x 1 – Intel DC P4600 u.2 • EDR Infiniband HCA x 2 – Connected to other Compute Notes and Filesystems 7 Xeon Gold 6148 Xeon Gold 6148 10.4GT/s x3DDR4-2666 32GB x 6 DDR4-2666 32GB x 6 128GB/s 128GB/s IB HCA (100Gbps)IB HCA (100Gbps) NVMe UPI x3 x48 switch x64 switch Tesla V100 SXM2 Tesla V100 SXM2 Tesla V100 SXM2 Tesla V100 SXM2 PCIe gen3 x16 PCIe gen3 x16 PCIe gen3 x16 PCIe gen3 x16 NVLink2 x2
  • 8. Rack as Dense-packaged “Pod” ( AB < ) 1 6 ) 0 BC / 0 BC ,3 0G < F BA -7 BH<D G D CF BA -7 FB <IF<DA CB ) 7 7 4 I 7 F<D , D 2 D .BB A 8Pod #1 LEAF#1 (SB7890) LEAF#2 (SB7890) LEAF#3 (SB7890) LEAF#4 (SB7890) SPINE#1 (CS7500) SPINE#2 (CS7500) CX40 0#1 CX2570#1 CX2570#2 CX40 0#2 CX2570#3 CX2570#4 CX40 0#3 CX2570#5 CX2570#6 CX40 0#17 CX2570#33 CX2570#34 FBB#1 (SB7890) FBB#2 (SB7890) FBB#3 (SB7890) 1/3 Oversubscription BW IB-EDR x 24 Full bisection BW IB-EDR x 72 InfiniBand EDR x1 InfiniBand EDR x6 InfiniBand EDR x4 x 32 pods
  • 9. Hierarchical Storage Tiers • Local Storage – 1.6 TB NVMe SSD (Intel DC P4600 u.2) per Node – Local Storage Aggregation w/ BeeOnd • Parallel Filesystem – 22PB of GPFS • DDN SFA14K ( w/ SS8462 Enclosure x 10) x 3 set • Bare Metal NSD servers and Flash-based Metadata Volumes for metadata operation acceleration – Home and Shared Use • Object Storage – Part of GPFS using OpenStack Swift – S3-like API Access, Global Shared Use – Additional Secure Volumes w/ Encryption (Planned) 9 Parallel Filesystem Local Storage as Burst Buffers Object Storage as Campaign Storage
  • 10. Performance Reference for Distributed Deep Learning 10 Better• Environments – ABCI 64 nodes (256 GPUs) – Framework: ChainerMN v1.3.0 • Chainer 4.2.0, Cupy 4.2.3, mpi4py 3.0.0, Python 3.6.5 – Baremetal • CentOS 7.4, gcc-4.8.5, CUDA 9.2, CuDNN 7.1.4, NCCL2.2, OpenMPI 2,1.3 • Settings – Dataset: Imagenet-1K – Model: ResNet-50 – Training: • Batch size: 32 per GPU, 32 x 256 in total • Learning Rate: Starting 0.1 and x0.1 at 30, 60, 80 epoch w/ warm up scheduling • Optimization: Momentum SGD (momentum=0.9) • Weight Decay: 0.0001 • Training Epoch: 100
  • 11. LGD M P C N D M A D U N , L A D I M C 11 /home (GPFS) Job Job Job Job NQS Submit Scheduling script file $ qsub <option> script_filename inter-connect SSH ( ) G (High Throughput Computing)
  • 12. /: / D 0 1 72 A : /: / D / 2 D 172 2 .2: , 2 ,C : ,/17/ 2 2 1 inkbc augk t v y 12 , Hgmeu P • 172 fp k Hh rs wogk L Q • lr t augk t T
  • 13. . , 13 (cont’d) CUDA8.0 8.0.44 8.0.61.2 CUDA9.0 9.0.176 CUDA9.1 9.1.85 9.1.85.1 9.1.85.3 CuDNN5.1 5.1.5 5.1.10 CuDNN6.0 6.0.21 CuDNN7.0 7.0.5 CuDNN7.1 7.1.1 7.1.2 7.1.3 CUDA9.2 9.2.88.1 NCCL1.3 1.3.4 1.3.40-1 NCCL2.1NCCL2.0 NCCL2.2 2.0.5-3 2.1.4-1 2.1.15-1 2.2.12 OpenMPI MVAPICH2-GDR2.1.3 3.0.1 3.1.0 2.3a Python 2.7 3.5 3.6 Python Modules mpi4py matplotlibCython Pillow Jupyter Caffe2 CNTK ChainerMN Tensorflow MXNetNnabla
  • 14. Software Stuck for ABCI • Batch Job Scheduler – High throughput computing • Minimum Pre-installed Software – Users can deploy their environments using anaconda, pip, python venv, etc. – Reduce operational cost • Container Support – Singularity for multi-node jobs w/ user customized images – Docker for single-node jobs w/ site-certified images 14 User Applications DL Frameworks Python, Ruby, R, Java, Scala, Perl, Lua, etc. ISV AppsOSSHadoop/ Spark GCC PGI OpenMPI MVAPICH2 Intel Parallel Studio XE Cluster Edition CUDA/CUDNN/NCCL GPFS BeeOND OpenStack Swift Univa Grid Engine Singularity Docker CentOS/RHEL
  • 15. ABCI : Dynamic Container Deployment with HPC Linux Containers Linux Container (Singulairty, Docker) GPFS/Object Storage Compute Node Compute Node Compute Node Compute Node Container Image Container Image Container Image Container Image Job Job Job Job Job Scheduler Container Image Register/copy container images Import/copy container images Submit jobs with container images Container image repository (Dockerhub, private registry)
  • 16. P D . , H O I CDHR D HAU C M P 16 CharlieCloud HPCEnterprise
  • 17. ( ( D u Mvtr B pae C . : . /: CA C m p D l Mgo mi ws cn PS :: M b kpI M :: HL hpd ws D :/ I y :/ . x D fcU S I y , . / 17
  • 18. ( ) sudo singularity build –sandbox tmpdir/ Singularity sudo singularity build –writable container.img Singularity sudo singularity build container.img Singularity sudo singularity build container.img docker://ubuntu sudo singularity build container.img shub://ubuntu ) S R sudo singularity shell –writable container.img D H R (, R ( ( , , , , container.img h S g singularity run container.img singularity exec container.img … singularity shell container.img a Sc ) ( e e
  • 19. : /. # $ / . $ -$ -$ . $ S 19
  • 20. 4 , C C GHI 4 4 ,4 M : c s a mn_e a e e s P B . . , kt M : c sP B Md v lN g I Bc s i ro . i NI up 4. 4 . - . , , . , . - 4. 4 - M : lN g l N 4. 4 g 20
  • 21. - : - D 0- 8 D H 0- 6/ .8 / 8/ / - / . 0- / / 8 $ 8 - - 6/ .8 / -:- / - / . / / 6 / 21 GPU --nv
  • 22. C B 3A3 : 3 0 : M C $ : 3 an f uV - a gi pN C $ c v - fo : 3 r P H ts I y m i NN _ O 3 - B leo . - m i x 22
  • 23. S y x R 40 HIA ) 29 R m in - 03 H K67 O S 0D H K 0N G 9 MDIH ) R oad S HCN K M O S NHMN ) C ( C R g l S 0 HM8 C ( C S yup R b c- 4G C H M 5 R bl- : 7 M ( R - S e W - 29 U P ) S s - I D U S t r- 6IG HMNG 21 GIG HMNG. , S = CDM 1 - S h v- 23 : Better:
  • 24. L i C 2 2 , 2 e D N GLbdP G GU M iA C 2 2 , 2 e 2 - ak M i P GU G L , , o n I G L c P 24 Base Drivers, Libraries on Host CUDA Drivers Infiniband Drivers Filesystem Libraries (GPFS, Lustre) Userland Libraries on Container CUDA CuDNN NCCL2 MPI (mpi4py) Mount ibverbs Distributed Deep Learning Frameworks Caffe2 ChainerMNDistributed TensorflowMXNet
  • 25. 25
  • 26. / . ) / m S C O = y t = i sn f y Sun v S C , =. / St i S C . / • C I /. , ) , ) , / ) O >– iU > ir ga lu bc S S ke o a 26
  • 27. A P oce ʼ A Oʼ A L ʼ t uO P V P d A P S r ʼ oceA – uO V V R I • O x O o n M A () P ao oceO P A – i dP• oce O od L i dO () o d n oce O od 27
  • 28. ei S Ll B sI B BCA I g e / R / ro Ra ei A L n ut 28
  • 29. 29