SlideShare a Scribd company logo
1 of 78
Download to read offline
T. Schulthess|
Thomas C. Schulthess
!1
Reflecting on the Goal and Baseline of
Exascale Computing
T. Schulthess|
Tracking supercomputer performance over time?
!2
Ax = bLinpack benchmark solves:
T. Schulthess|
Tracking supercomputer performance over time?
!2
Ax = bLinpack benchmark solves:
T. Schulthess|
Tracking supercomputer performance over time?
!2
Ax = bLinpack benchmark solves:
1,000-fold performance improvement per decade
T. Schulthess|
Tracking supercomputer performance over time?
!2
Ax = bLinpack benchmark solves:
1st application at > 1 TFLOP/s sustained
1,000-fold performance improvement per decade
T. Schulthess|
Tracking supercomputer performance over time?
!2
Ax = bLinpack benchmark solves:
1st application at > 1 TFLOP/s sustained
1st application at > 1 PFLOP/s sustained
1,000-fold performance improvement per decade
T. Schulthess|
Tracking supercomputer performance over time?
!2
Ax = bLinpack benchmark solves:
1st application at > 1 TFLOP/s sustained
1st application at > 1 PFLOP/s sustained
1,000-fold performance improvement per decade
KKR-CPA (MST)
T. Schulthess|
Tracking supercomputer performance over time?
!2
Ax = bLinpack benchmark solves:
1st application at > 1 TFLOP/s sustained
1st application at > 1 PFLOP/s sustained
1,000-fold performance improvement per decade
KKR-CPA (MST)
LSMS (MST)
T. Schulthess|
Tracking supercomputer performance over time?
!2
Ax = bLinpack benchmark solves:
1st application at > 1 TFLOP/s sustained
1st application at > 1 PFLOP/s sustained
1,000-fold performance improvement per decade
KKR-CPA (MST)
LSMS (MST)
WL-LSMS (MST)
T. Schulthess|
Tracking supercomputer performance over time?
!2
Ax = bLinpack benchmark solves:
1st application at > 1 TFLOP/s sustained
1st application at > 1 PFLOP/s sustained
1,000-fold performance improvement per decade
KKR-CPA (MST)
LSMS (MST)
WL-LSMS (MST)
T. Schulthess|
Tracking supercomputer performance over time?
!2
Ax = bLinpack benchmark solves:
1st application at > 1 TFLOP/s sustained
1st application at > 1 PFLOP/s sustained
1,000-fold performance improvement per decade
KKR-CPA (MST)
LSMS (MST)
WL-LSMS (MST)
1,000x perf. improv. per decade seems
hold for multiple-scattering-theory(MST)-
based electronic structure for materials science
T. Schulthess|
“Only” 100-fold performance improvement in climate codes
!3
Source: Peter Bauer, ECMWFSource: Peter Bauer, ECMWF
T. Schulthess| !4
Has the efficiency of weather & climate
codes dropped 10-fold every decade?
T. Schulthess|
Floating points efficiency dropped from 50% on Cray Y-MP to
5% on today’s Cray XC (10x in 2 decades)
!5
Source: Peter Bauer, ECMWF
T. Schulthess|
Floating points efficiency dropped from 50% on Cray Y-MP to
5% on today’s Cray XC (10x in 2 decades)
!5
Source: Peter Bauer, ECMWF
KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
T. Schulthess|
Floating points efficiency dropped from 50% on Cray Y-MP to
5% on today’s Cray XC (10x in 2 decades)
!5
Source: Peter Bauer, ECMWF
Cray Y-MP @ 300kW
KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
T. Schulthess|
Floating points efficiency dropped from 50% on Cray Y-MP to
5% on today’s Cray XC (10x in 2 decades)
!5
Source: Peter Bauer, ECMWF
Cray Y-MP @ 300kW
Cray XT5 @ 7MW
KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
T. Schulthess|
Floating points efficiency dropped from 50% on Cray Y-MP to
5% on today’s Cray XC (10x in 2 decades)
!5
Source: Peter Bauer, ECMWF
Cray Y-MP @ 300kW
Cray XT5 @ 7MW
KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
IBM P5 @ 400 kW
T. Schulthess|
Floating points efficiency dropped from 50% on Cray Y-MP to
5% on today’s Cray XC (10x in 2 decades)
!5
Source: Peter Bauer, ECMWF
Cray Y-MP @ 300kW
Cray XT5 @ 7MW
KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
IBM P5 @ 400 kW
IBM P6 @ 1.3 MW
T. Schulthess|
Floating points efficiency dropped from 50% on Cray Y-MP to
5% on today’s Cray XC (10x in 2 decades)
!5
Source: Peter Bauer, ECMWF
Cray Y-MP @ 300kW
Cray XT5 @ 7MW
Cray XT5 @ 1.8 MW
KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
IBM P5 @ 400 kW
IBM P6 @ 1.3 MW
T. Schulthess|
Floating points efficiency dropped from 50% on Cray Y-MP to
5% on today’s Cray XC (10x in 2 decades)
!5
Source: Peter Bauer, ECMWF
Cray Y-MP @ 300kW
Cray XT5 @ 7MW
Cray XT5 @ 1.8 MW
System size (in energy footprint) grew 

much faster on “Top500” systems
KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
IBM P5 @ 400 kW
IBM P6 @ 1.3 MW
T. Schulthess| !6
Source: Christoph Schär, ETH Zurich, & Nils Wedi, ECMWF
T. Schulthess| !6
Source: Christoph Schär, ETH Zurich, & Nils Wedi, ECMWF
T. Schulthess| !6
Source: Christoph Schär, ETH Zurich, & Nils Wedi, ECMWF
Can the delivery of a 1km-scale
capability be pulled in by a decade?
T. Schulthess|
Leadership in weather and climate
!7
Peter Bauer, ECMWF
T. Schulthess|
Leadership in weather and climate
!7
European model may be the best – but far away
from sufficient accuracy and reliability!
Peter Bauer, ECMWF
T. Schulthess|
The impact of resolution: simulated tropical cyclones
!8
130 km 60 km
25 km Observations
HADGEM3 PRACE UPSCALE, P.L. Vidale (NCAS) and M. Roberts (MO/HC)
T. Schulthess|
Resolving convective clouds (convergence?)
!9
Source: Christoph Schär, ETH Zurich
Bulk convergence Structural convergence
Area-averaged bulk effects upon ambient flow:
E.g., heating and moistening of cloud layer
Statistics of cloud ensemble:
E.g., spacing and size of convective clouds
T. Schulthess|
Structural and bulk convergence
!10
relativefrequency
10
−6
10
−5
10
−4
10
−3
10
−2
10
−1
10
0
cloud area [km
2
]
10
−2
10
0
10
2
10
4
71
64
54
47
43
grid-scale
clouds
[%]
convective mass flux [kg m−2
s−1
]
−10 −5 0 5 10 15
relativefrequency
10
−6
10
−5
10
−4
10
−3
10
−2
10
−1
10
0
10
−7
8 km
4 km
2 km
1 km
500 m
Source: Christoph Schär, ETH Zurich
Statistics of cloud area Statistics of up- & downdrafts
No structural convergence Bulk statistics of updrafts converges
Factor 4
(Panosetti et al. 2018)
T. Schulthess| !11
What resolution is needed?
• There are threshold scales in the atmosphere and ocean: going from 100
km to 10 km is incremental, 10 km to 1 km is a leap. At 1km
• it is no longer necessary to parametrise precipitating convection, ocean
eddies, or orographic wave drag and its effect on extratropical storms;
• ocean bathymetry, overflows and mixing, as well as regional orographic
circulation in the atmosphere become resolved;
• the connection between the remaining parametrisation are now on a
physical footing.
• We spend the last five decades in a paradigm of incremental advances.
Here we incrementally improved the resolution of models from 200 to 20km
• Exascale allows us to make the leap to 1 km. This fundamentally changes
the structure of our models. We move from crude parametric presentations to
an explicit, physics based, description of essential processes.
• The last such step change was fifty years ago. This was when, in the late
1960s, climate scientists first introduced global climate models, which were
distinguished by their ability to explicitly represent extra-tropical storms, ocean
gyres and boundary current.
Bjorn Stevens, MPI-M
T. Schulthess|
Simulation throughput: Simulate Years Per Day (SPYD)
!12
NWP Climate in production Climate spinup
Simulation 10 d 100 y 5’000 y
Desired wall clock time 0.1 d 0.1 y 0.5 y
ratio 100 1'000 10'000
SYPD 0.27 2.7 27
T. Schulthess|
Simulation throughput: Simulate Years Per Day (SPYD)
!12
NWP Climate in production Climate spinup
Simulation 10 d 100 y 5’000 y
Desired wall clock time 0.1 d 0.1 y 0.5 y
ratio 100 1'000 10'000
SYPD 0.27 2.7 27
Minimal throughout 1 SYPD, preferred 5 SYPD
T. Schulthess|
Summary of intermediate goal (reach by 2021?)
!13
Horizontal resolution 1 km (globally quasi-uniform)
Vertical resolution 180 levels (surface to ~100 km)
Time resolution Less than 1 minute
Coupled Land-surface/ocean/ocean-waves/sea-ice
Atmosphere Non-hydrostatic
Precision Single (32bit) or mixed precision
Compute rate 1 SYPD (simulated year wall-clock day)
T. Schulthess|
Running COSMO 5.0 at global scale on Piz Daint
!14
Scaling to full system size: ~5300 GPU accelerate nodes available
Running a near-global (±80º covering 97% of Earths surface) COSMO 5.0 simulation & IFS
> Either on the hosts processors: Intel Xeon E5 2690v3 (Haswell 12c).
> Or on the GPU accelerator: PCIe version ofNVIDIA GP100 (Pascal) GPU
T. Schulthess| !15
September 15, 2015
Today’s Outlook: GPU-accelerated Weather Forecasting
John Russell
“Piz Kesch”
T. Schulthess| !16
1
5
10
15
20
25
30
35
40
Constant budget for investments and operations
Grid 2.2 km ! 1.1 km
24x
Ensemble with multiple forecasts
Data assimilation
10x
Requirements from MeteoSwiss
6x
T. Schulthess| !16
1
5
10
15
20
25
30
35
40
Constant budget for investments and operations
Grid 2.2 km ! 1.1 km
24x
Ensemble with multiple forecasts
Data assimilation
10x
Requirements from MeteoSwiss
6x
We need a 40x improvement between 2012 and 2015 at constant cost
T. Schulthess|
Where the factor 40 improvement came from
!16
1
5
10
15
20
25
30
35
40
Constant budget for investments and operations
Grid 2.2 km ! 1.1 km
24x
Ensemble with multiple forecasts
Data assimilation
10x
Requirements from MeteoSwiss
6x
We need a 40x improvement between 2012 and 2015 at constant cost
T. Schulthess|
Where the factor 40 improvement came from
!16
1
5
10
15
20
25
30
35
40
Constant budget for investments and operations
Grid 2.2 km ! 1.1 km
24x
Ensemble with multiple forecasts
Data assimilation
10x
Requirements from MeteoSwiss
6x
Investment in software allowed mathematical improvements and change in architecture
We need a 40x improvement between 2012 and 2015 at constant cost
T. Schulthess|
Where the factor 40 improvement came from
!16
1
5
10
15
20
25
30
35
40
Constant budget for investments and operations
Grid 2.2 km ! 1.1 km
24x
Ensemble with multiple forecasts
Data assimilation
10x
1.7x from software refactoring (old vs. new implementation on x86)
Requirements from MeteoSwiss
6x
Investment in software allowed mathematical improvements and change in architecture
We need a 40x improvement between 2012 and 2015 at constant cost
T. Schulthess|
Where the factor 40 improvement came from
!16
1
5
10
15
20
25
30
35
40
Constant budget for investments and operations
Grid 2.2 km ! 1.1 km
24x
Ensemble with multiple forecasts
Data assimilation
10x
1.7x from software refactoring (old vs. new implementation on x86)
2.8x Mathematical improvements (resource utilisation, precision)
Requirements from MeteoSwiss
6x
Investment in software allowed mathematical improvements and change in architecture
We need a 40x improvement between 2012 and 2015 at constant cost
T. Schulthess|
Where the factor 40 improvement came from
!16
1
5
10
15
20
25
30
35
40
Constant budget for investments and operations
Grid 2.2 km ! 1.1 km
24x
Ensemble with multiple forecasts
Data assimilation
10x
1.7x from software refactoring (old vs. new implementation on x86)
2.8x Mathematical improvements (resource utilisation, precision)
2.3x Change in architecture (CPU ! GPU)
Requirements from MeteoSwiss
6x
Investment in software allowed mathematical improvements and change in architecture
We need a 40x improvement between 2012 and 2015 at constant cost
T. Schulthess|
Where the factor 40 improvement came from
!16
1
5
10
15
20
25
30
35
40
Constant budget for investments and operations
Grid 2.2 km ! 1.1 km
24x
Ensemble with multiple forecasts
Data assimilation
10x
1.7x from software refactoring (old vs. new implementation on x86)
2.8x Mathematical improvements (resource utilisation, precision)
2.8x Moore’s Law & arch. improvements on x86
2.3x Change in architecture (CPU ! GPU)
Requirements from MeteoSwiss
6x
Investment in software allowed mathematical improvements and change in architecture
We need a 40x improvement between 2012 and 2015 at constant cost
T. Schulthess|
Where the factor 40 improvement came from
!16
1
5
10
15
20
25
30
35
40
Constant budget for investments and operations
Grid 2.2 km ! 1.1 km
24x
Ensemble with multiple forecasts
Data assimilation
10x
1.7x from software refactoring (old vs. new implementation on x86)
2.8x Mathematical improvements (resource utilisation, precision)
2.8x Moore’s Law & arch. improvements on x86
2.3x Change in architecture (CPU ! GPU)
1.3x additional processors
Requirements from MeteoSwiss
6x
Investment in software allowed mathematical improvements and change in architecture
We need a 40x improvement between 2012 and 2015 at constant cost
T. Schulthess|
Where the factor 40 improvement came from
!16
1
5
10
15
20
25
30
35
40
Constant budget for investments and operations
Grid 2.2 km ! 1.1 km
24x
Ensemble with multiple forecasts
Data assimilation
10x
1.7x from software refactoring (old vs. new implementation on x86)
2.8x Mathematical improvements (resource utilisation, precision)
2.8x Moore’s Law & arch. improvements on x86
2.3x Change in architecture (CPU ! GPU)
1.3x additional processors
Requirements from MeteoSwiss
6x
Investment in software allowed mathematical improvements and change in architecture
Bonus: reduction in power!
We need a 40x improvement between 2012 and 2015 at constant cost
T. Schulthess|
Where the factor 40 improvement came from
!16
1
5
10
15
20
25
30
35
40
Constant budget for investments and operations
Grid 2.2 km ! 1.1 km
24x
Ensemble with multiple forecasts
Data assimilation
10x
1.7x from software refactoring (old vs. new implementation on x86)
2.8x Mathematical improvements (resource utilisation, precision)
2.8x Moore’s Law & arch. improvements on x86
2.3x Change in architecture (CPU ! GPU)
1.3x additional processors
Requirements from MeteoSwiss
6x
Investment in software allowed mathematical improvements and change in architecture
There is no silver bullet!
Bonus: reduction in power!
We need a 40x improvement between 2012 and 2015 at constant cost
T. Schulthess|
Near-global climate simulation at 1km resolution: establishing
a performance baseline on 4888 GPUs with COSMO 5.0
!17
Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
T. Schulthess|
Near-global climate simulation at 1km resolution: establishing
a performance baseline on 4888 GPUs with COSMO 5.0
!17
0.01
0.1
1
10
100
488810 100 1000
SYPD
#nodes
Δx = 19 km, P100
Δx = 19 km, Haswell
Δx = 3.7 km, P100
Δx = 3.7 km, Haswell
Δx = 1.9 km, P100
Δx = 930 m, P100
climate simulation on 4,888 GPUs at 1 km resolution SC’17, November 2017, Denver, CO
440010 100 1000
#nodes
128x128
160x160
192x192
256x256
bility on the hybrid P100 Piz Daint
SMO time step of the dry simula-
0.01
0.1
1
10
100
488810 100 1000
SYPD
#nodes
∆x = 19 km, P100
∆x = 19 km, Haswell
∆x = 3.7 km, P100
∆x = 3.7 km, Haswell
∆x = 1.9 km, P100
∆x = 930 m, P100
(b) Strong scalability on Piz Daint. (Filled sym-
bols) On P100 GPUs, and (empty symbols) on
Haswell CPUs, using 12 MPI ranks per node.
h xi #nodes t [s] SYPD MWh/SY gridpoints
930 m 4,888 6 0.043 596 3.46⇥1010
1.9 km 4,888 12 0.23 97.8 8.64 ⇥ 109
47 km 18 300 9.6 0.099 1.39 ⇥ 107
(c) Time compression (SYPD) and energy cost (MWh/SY)
for three moist simulations. At 930 m grid spacing ob-
tained with a full 10d simulation, at 1.9 km from 1,000
steps, and at 47 km from 100 steps
gure 6: (a) Weak and (b) strong scalability results and (c) summary of the time compression achieved in terms of SYPD.
ystem. As argued in Section 3, such large timesteps are
ble for global climate simulations resolving convective
when using implicit solvers), and a maximum timestep
level METIS Q COSMO D no merging ˆD
registers 1.51 · 109 (20240) 1.72 · 109 (20160) 2.6 · 109 (112618)
sh. mem. 64,800 (245) 107,600 (255) 229,120 (2649)
Metric: simulated years per wall-clock day
Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
T. Schulthess|
Near-global climate simulation at 1km resolution: establishing
a performance baseline on 4888 GPUs with COSMO 5.0
!17
0.01
0.1
1
10
100
488810 100 1000
SYPD
#nodes
Δx = 19 km, P100
Δx = 19 km, Haswell
Δx = 3.7 km, P100
Δx = 3.7 km, Haswell
Δx = 1.9 km, P100
Δx = 930 m, P100
climate simulation on 4,888 GPUs at 1 km resolution SC’17, November 2017, Denver, CO
440010 100 1000
#nodes
128x128
160x160
192x192
256x256
bility on the hybrid P100 Piz Daint
SMO time step of the dry simula-
0.01
0.1
1
10
100
488810 100 1000
SYPD
#nodes
∆x = 19 km, P100
∆x = 19 km, Haswell
∆x = 3.7 km, P100
∆x = 3.7 km, Haswell
∆x = 1.9 km, P100
∆x = 930 m, P100
(b) Strong scalability on Piz Daint. (Filled sym-
bols) On P100 GPUs, and (empty symbols) on
Haswell CPUs, using 12 MPI ranks per node.
h xi #nodes t [s] SYPD MWh/SY gridpoints
930 m 4,888 6 0.043 596 3.46⇥1010
1.9 km 4,888 12 0.23 97.8 8.64 ⇥ 109
47 km 18 300 9.6 0.099 1.39 ⇥ 107
(c) Time compression (SYPD) and energy cost (MWh/SY)
for three moist simulations. At 930 m grid spacing ob-
tained with a full 10d simulation, at 1.9 km from 1,000
steps, and at 47 km from 100 steps
gure 6: (a) Weak and (b) strong scalability results and (c) summary of the time compression achieved in terms of SYPD.
ystem. As argued in Section 3, such large timesteps are
ble for global climate simulations resolving convective
when using implicit solvers), and a maximum timestep
level METIS Q COSMO D no merging ˆD
registers 1.51 · 109 (20240) 1.72 · 109 (20160) 2.6 · 109 (112618)
sh. mem. 64,800 (245) 107,600 (255) 229,120 (2649)
Metric: simulated years per wall-clock day
2.5x faster than Yang et al.’s 2016 Gordon Bell winner run on TaihuLight!
Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
T. Schulthess|
The baseline for COSMO-global and IFS
!18
Near-global COSMO [Fuh2018] Global IFS [Wed2009]
Value Shortfall Value Shortfall
Horizontal resolution 0.93 km (non-
uniform)
0.9x 1.25 km 1.25x
Vertical resolution 60 levels (surface
to 25 km)
3x 62 levels (surface to
40 km)
3x
Time resolution 6 s (split-explicit
with sub-
stepping)
-
120 s (semi-implicit)
4x
Coupled No 1.2x No 1.2x
Atmosphere Non-hydrostatic - Non-hydrostatic -
Precision Double 0.6x Single -
Compute rate 0.043 SYPD 23x 0.088 SYPD 11x
Other (I/O, full physics, …) Limited I/O
Only
microphysics
1.5x Full physics, no I/O -
Total shortfall 65x 198x
T. Schulthess|
Memory use efficiency
!19
MUE = I/O e ciency · BW e ciency =
Q
D
B
ˆB
Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
T. Schulthess|
Memory use efficiency
!19
MUE = I/O e ciency · BW e ciency =
Q
D
B
ˆB
Necessary data transfers
Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
T. Schulthess|
Memory use efficiency
!19
MUE = I/O e ciency · BW e ciency =
Q
D
B
ˆB
Necessary data transfers
Actual data transfers
Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
T. Schulthess|
Memory use efficiency
!19
MUE = I/O e ciency · BW e ciency =
Q
D
B
ˆB
Necessary data transfers
Actual data transfers
Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
0.88
T. Schulthess|
Memory use efficiency
!19
MUE = I/O e ciency · BW e ciency =
Q
D
B
ˆB
Necessary data transfers
Actual data transfers
Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
Max achievable BW
0.88
T. Schulthess|
Memory use efficiency
!19
MUE = I/O e ciency · BW e ciency =
Q
D
B
ˆB
Necessary data transfers
Actual data transfers
Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
Achieved BW
Max achievable BW
0.88
T. Schulthess|
Memory use efficiency
!19
MUE = I/O e ciency · BW e ciency =
Q
D
B
ˆB
Necessary data transfers
Actual data transfers
Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
Achieved BW
Max achievable BW
0.88
0.76
T. Schulthess|
Memory use efficiency
!19
0
100
200
300
400
500
600
0.1 1000
MemoryBW(GB/s)
Data size (MB)
28.2 1,00010010.1
362
10
COPY (double)
a[i] = b[i]
GPU STREAM (double)
a[i] = b[i] (1D)
AVG i-stride (float)
a[i]=b[i-1]+b[i+1]
5-POINT (float)
a[i] = b[i] + b[i+1] + b[i-1] +
b[i+jstride] +b[i-jstride]
COPY (float)
a[i] = b[i]
MUE = I/O e ciency · BW e ciency =
Q
D
B
ˆB
Necessary data transfers
Actual data transfers
Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
Achieved BW
Max achievable BW
0.88
0.76
T. Schulthess|
Memory use efficiency
!19
0
100
200
300
400
500
600
0.1 1000
MemoryBW(GB/s)
Data size (MB)
28.2 1,00010010.1
362
10
COPY (double)
a[i] = b[i]
GPU STREAM (double)
a[i] = b[i] (1D)
AVG i-stride (float)
a[i]=b[i-1]+b[i+1]
5-POINT (float)
a[i] = b[i] + b[i+1] + b[i-1] +
b[i+jstride] +b[i-jstride]
COPY (float)
a[i] = b[i]
MUE = I/O e ciency · BW e ciency =
Q
D
B
ˆB
Necessary data transfers
Actual data transfers
Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
Achieved BW
Max achievable BW
0.88
0.76
2x lower than peak BW
T. Schulthess|
Memory use efficiency
!19
0
100
200
300
400
500
600
0.1 1000
MemoryBW(GB/s)
Data size (MB)
28.2 1,00010010.1
362
10
COPY (double)
a[i] = b[i]
GPU STREAM (double)
a[i] = b[i] (1D)
AVG i-stride (float)
a[i]=b[i-1]+b[i+1]
5-POINT (float)
a[i] = b[i] + b[i+1] + b[i-1] +
b[i+jstride] +b[i-jstride]
COPY (float)
a[i] = b[i]
MUE = I/O e ciency · BW e ciency =
Q
D
B
ˆB
Necessary data transfers
Actual data transfers
Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
Achieved BW
Max achievable BW
0.88
0.76
= 0.67
2x lower than peak BW
T. Schulthess|
How realistic is it to overcome 65-fold shortfall of a grid-based
implementation like COSMO-global?
!20
T. Schulthess|
How realistic is it to overcome 65-fold shortfall of a grid-based
implementation like COSMO-global?
!20
1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)
2x fewer grid-columns
Time step of 10 ms instead of 5 ms
4x
T. Schulthess|
How realistic is it to overcome 65-fold shortfall of a grid-based
implementation like COSMO-global?
!20
0
100
200
300
400
500
600
0.1 1000
MemoryBW(GB/s)
Data size (MB)
28.2 1,00010010.1
362
10
COPY (double)
a[i] = b[i]
GPU STREAM (double)
a[i] = b[i] (1D)
AVG i-stride (float)
a[i]=b[i-1]+b[i+1]
5-POINT (float)
a[i] = b[i] + b[i+1] + b[i-1] +
b[i+jstride] +b[i-jstride]
COPY (float)
a[i] = b[i]
1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)
2x fewer grid-columns
Time step of 10 ms instead of 5 ms
4x
T. Schulthess|
How realistic is it to overcome 65-fold shortfall of a grid-based
implementation like COSMO-global?
!20
0
100
200
300
400
500
600
0.1 1000
MemoryBW(GB/s)
Data size (MB)
28.2 1,00010010.1
362
10
COPY (double)
a[i] = b[i]
GPU STREAM (double)
a[i] = b[i] (1D)
AVG i-stride (float)
a[i]=b[i-1]+b[i+1]
5-POINT (float)
a[i] = b[i] + b[i+1] + b[i-1] +
b[i+jstride] +b[i-jstride]
COPY (float)
a[i] = b[i]
1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)
2x fewer grid-columns
Time step of 10 ms instead of 5 ms
4x
2. Improving BW efficiency
Improve BW efficiency and peak BW 2x
(results on Volta show this is realistic)
T. Schulthess|
How realistic is it to overcome 65-fold shortfall of a grid-based
implementation like COSMO-global?
!20
0
100
200
300
400
500
600
0.1 1000
MemoryBW(GB/s)
Data size (MB)
28.2 1,00010010.1
362
10
COPY (double)
a[i] = b[i]
GPU STREAM (double)
a[i] = b[i] (1D)
AVG i-stride (float)
a[i]=b[i-1]+b[i+1]
5-POINT (float)
a[i] = b[i] + b[i+1] + b[i-1] +
b[i+jstride] +b[i-jstride]
COPY (float)
a[i] = b[i]
0.01
0.1
1
10
100
488810 100 1000
SYPD
#nodes
Δx = 19 km, P100
Δx = 19 km, Haswell
Δx = 3.7 km, P100
Δx = 3.7 km, Haswell
Δx = 1.9 km, P100
Δx = 930 m, P100
1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)
2x fewer grid-columns
Time step of 10 ms instead of 5 ms
4x
2. Improving BW efficiency
Improve BW efficiency and peak BW 2x
(results on Volta show this is realistic)
T. Schulthess|
How realistic is it to overcome 65-fold shortfall of a grid-based
implementation like COSMO-global?
!20
0
100
200
300
400
500
600
0.1 1000
MemoryBW(GB/s)
Data size (MB)
28.2 1,00010010.1
362
10
COPY (double)
a[i] = b[i]
GPU STREAM (double)
a[i] = b[i] (1D)
AVG i-stride (float)
a[i]=b[i-1]+b[i+1]
5-POINT (float)
a[i] = b[i] + b[i+1] + b[i-1] +
b[i+jstride] +b[i-jstride]
COPY (float)
a[i] = b[i]
0.01
0.1
1
10
100
488810 100 1000
SYPD
#nodes
Δx = 19 km, P100
Δx = 19 km, Haswell
Δx = 3.7 km, P100
Δx = 3.7 km, Haswell
Δx = 1.9 km, P100
Δx = 930 m, P100
1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)
2x fewer grid-columns
Time step of 10 ms instead of 5 ms
4x
2. Improving BW efficiency
Improve BW efficiency and peak BW 2x
(results on Volta show this is realistic)
3. Weak scaling
4x possible in COSMO, but we reduced 

available parallelism by factor 2
2x
T. Schulthess|
How realistic is it to overcome 65-fold shortfall of a grid-based
implementation like COSMO-global?
!20
0
100
200
300
400
500
600
0.1 1000
MemoryBW(GB/s)
Data size (MB)
28.2 1,00010010.1
362
10
COPY (double)
a[i] = b[i]
GPU STREAM (double)
a[i] = b[i] (1D)
AVG i-stride (float)
a[i]=b[i-1]+b[i+1]
5-POINT (float)
a[i] = b[i] + b[i+1] + b[i-1] +
b[i+jstride] +b[i-jstride]
COPY (float)
a[i] = b[i]
0.01
0.1
1
10
100
488810 100 1000
SYPD
#nodes
Δx = 19 km, P100
Δx = 19 km, Haswell
Δx = 3.7 km, P100
Δx = 3.7 km, Haswell
Δx = 1.9 km, P100
Δx = 930 m, P100
1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)
2x fewer grid-columns
Time step of 10 ms instead of 5 ms
4x
2. Improving BW efficiency
Improve BW efficiency and peak BW 2x
(results on Volta show this is realistic)
3. Weak scaling
4x possible in COSMO, but we reduced 

available parallelism by factor 2
2x
4. Remaining reduction in shortfall 4x
Numerical algorithms (larger time steps)
Further improved processors / memory
T. Schulthess|
How realistic is it to overcome 65-fold shortfall of a grid-based
implementation like COSMO-global?
!20
0
100
200
300
400
500
600
0.1 1000
MemoryBW(GB/s)
Data size (MB)
28.2 1,00010010.1
362
10
COPY (double)
a[i] = b[i]
GPU STREAM (double)
a[i] = b[i] (1D)
AVG i-stride (float)
a[i]=b[i-1]+b[i+1]
5-POINT (float)
a[i] = b[i] + b[i+1] + b[i-1] +
b[i+jstride] +b[i-jstride]
COPY (float)
a[i] = b[i]
0.01
0.1
1
10
100
488810 100 1000
SYPD
#nodes
Δx = 19 km, P100
Δx = 19 km, Haswell
Δx = 3.7 km, P100
Δx = 3.7 km, Haswell
Δx = 1.9 km, P100
Δx = 930 m, P100
1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)
2x fewer grid-columns
Time step of 10 ms instead of 5 ms
4x
2. Improving BW efficiency
Improve BW efficiency and peak BW 2x
(results on Volta show this is realistic)
3. Weak scaling
4x possible in COSMO, but we reduced 

available parallelism by factor 2
2x
4. Remaining reduction in shortfall 4x
Numerical algorithms (larger time steps)
Further improved processors / memory
But we don’t want to increase the footprint of the 2021 system beyond “Piz Daint”
T. Schulthess|
The importance of ensembles
!21
Peter Bauer, ECMWF
T. Schulthess|
The importance of ensembles
!21
Peter Bauer, ECMWF
T. Schulthess|
Remaining goals beyond 2021 (by 2024?)
!22
1. Improve the throughput to 5 SYPD
2. Reduce the footprint of a single simulation by up to factor 10
T. Schulthess|
Data challenge: what is all models run at 1 km scale?
!23
With current workflow used by the climate community, 

IPPC at 1 km scale avg. would produce 50 exabytes of data!
The only way out would be to analyse the model while the simulation is running
New workflow:
1. first set of runs generate model trajectories and checkpoints
2. Reconstruct model trajectories from checkpoints and analyse

(as often as necessary)
T. Schulthess| !24
Summary and Conclusions
T. Schulthess| !24
Summary and Conclusions
• While flop/s may be good to compare performance in history of computing, it is not a
good metric to design the systems of the future
T. Schulthess| !24
Summary and Conclusions
• While flop/s may be good to compare performance in history of computing, it is not a
good metric to design the systems of the future
• Given todays challenges use Memory Use Efficiency (MUE) instead
MUE = “I/O Efficiency” X “Bandwidth Efficiency”
T. Schulthess| !24
Summary and Conclusions
• While flop/s may be good to compare performance in history of computing, it is not a
good metric to design the systems of the future
• Given todays challenges use Memory Use Efficiency (MUE) instead
MUE = “I/O Efficiency” X “Bandwidth Efficiency”
• Convection resolving weather and climate simulations @ 1 km horizontal resolution
• Represents a big leap for quality of weather and climate simulations
• Aim is to pull in the milestone by a decade, in the early 2020s rather than 2030s
• Desired throughput of 1 SYPD by 2021 and 5 SYPD by mid 2020s
• Need a 50-200x performance improvement to reach 2021 goal of 1SYPD
T. Schulthess| !24
Summary and Conclusions
• While flop/s may be good to compare performance in history of computing, it is not a
good metric to design the systems of the future
• Given todays challenges use Memory Use Efficiency (MUE) instead
MUE = “I/O Efficiency” X “Bandwidth Efficiency”
• Convection resolving weather and climate simulations @ 1 km horizontal resolution
• Represents a big leap for quality of weather and climate simulations
• Aim is to pull in the milestone by a decade, in the early 2020s rather than 2030s
• Desired throughput of 1 SYPD by 2021 and 5 SYPD by mid 2020s
• Need a 50-200x performance improvement to reach 2021 goal of 1SYPD
• System improvement needed
• Improve “Bandwidth Efficiency” for regular but non-stride-1 memory access
• Improve application software
• Further improvement of scalability
• Improve algorithms (time integration has potential)
T. Schulthess| !25
Collaborators
Tim Palmer (U. of Oxford)
Christoph Schar (ETH Zurich)
Oliver Fuhrer (MeteoSwiss)
Peter Bauer (ECMWF)
Bjorn Stevens (MPI-M)
Torsten Hoefler (ETH Zurich)Nils Wedi (ECMWF)

More Related Content

Similar to Reflecting on the Goal and Baseline of Exascale Computing

Thesis_presentation1
Thesis_presentation1Thesis_presentation1
Thesis_presentation1
Bhushan Velis
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
The Statistical and Applied Mathematical Sciences Institute
 
Carlos N - CIAT Experience In Climate Modeling; Scenarios of future climate c...
Carlos N - CIAT Experience In Climate Modeling; Scenarios of future climate c...Carlos N - CIAT Experience In Climate Modeling; Scenarios of future climate c...
Carlos N - CIAT Experience In Climate Modeling; Scenarios of future climate c...
Decision and Policy Analysis Program
 
An Update on CSCS
An Update on CSCSAn Update on CSCS
An Update on CSCS
inside-BigData.com
 

Similar to Reflecting on the Goal and Baseline of Exascale Computing (20)

EcoTas13 BradEvans e-MAST
EcoTas13 BradEvans e-MASTEcoTas13 BradEvans e-MAST
EcoTas13 BradEvans e-MAST
 
Caixa Empreender Award | Cor Power Ocean (BGI)
Caixa Empreender Award | Cor Power Ocean (BGI)Caixa Empreender Award | Cor Power Ocean (BGI)
Caixa Empreender Award | Cor Power Ocean (BGI)
 
Nwtc turb sim workshop september 22 24, 2008- boundary layer dynamics and wi...
Nwtc turb sim workshop september 22 24, 2008-  boundary layer dynamics and wi...Nwtc turb sim workshop september 22 24, 2008-  boundary layer dynamics and wi...
Nwtc turb sim workshop september 22 24, 2008- boundary layer dynamics and wi...
 
Structural Health Monitoring: The paradigm and the benefits shown in some mon...
Structural Health Monitoring: The paradigm and the benefits shown in some mon...Structural Health Monitoring: The paradigm and the benefits shown in some mon...
Structural Health Monitoring: The paradigm and the benefits shown in some mon...
 
Thesis_presentation1
Thesis_presentation1Thesis_presentation1
Thesis_presentation1
 
Proposal Wide Blower No.1 & 2 with inverter
Proposal Wide Blower No.1 & 2  with inverterProposal Wide Blower No.1 & 2  with inverter
Proposal Wide Blower No.1 & 2 with inverter
 
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
 
DMUG 2016 - Scott Hamilton, Ricardo Energy & Environment
DMUG 2016 - Scott Hamilton, Ricardo Energy & EnvironmentDMUG 2016 - Scott Hamilton, Ricardo Energy & Environment
DMUG 2016 - Scott Hamilton, Ricardo Energy & Environment
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Self-charging, Highly Accurate Insole-Based Health Trackers for Medical Grade...
Self-charging, Highly Accurate Insole-Based Health Trackers for Medical Grade...Self-charging, Highly Accurate Insole-Based Health Trackers for Medical Grade...
Self-charging, Highly Accurate Insole-Based Health Trackers for Medical Grade...
 
Wind Generators: the advantages and disadvantages
Wind Generators: the advantages and disadvantagesWind Generators: the advantages and disadvantages
Wind Generators: the advantages and disadvantages
 
Presentation-for-teachers.pptx
Presentation-for-teachers.pptxPresentation-for-teachers.pptx
Presentation-for-teachers.pptx
 
Carlos N - CIAT Experience In Climate Modeling; Scenarios of future climate c...
Carlos N - CIAT Experience In Climate Modeling; Scenarios of future climate c...Carlos N - CIAT Experience In Climate Modeling; Scenarios of future climate c...
Carlos N - CIAT Experience In Climate Modeling; Scenarios of future climate c...
 
DYNAMIC ANALYSIS AND GEOMETRY DESIGN OF FLOATING OFFSHORE WIND TURBINE (FOWTS)
DYNAMIC ANALYSIS AND GEOMETRY DESIGN OF FLOATING OFFSHORE WIND TURBINE (FOWTS)DYNAMIC ANALYSIS AND GEOMETRY DESIGN OF FLOATING OFFSHORE WIND TURBINE (FOWTS)
DYNAMIC ANALYSIS AND GEOMETRY DESIGN OF FLOATING OFFSHORE WIND TURBINE (FOWTS)
 
CLIC Accelerator: status, plans and outlook
CLIC Accelerator: status, plans and outlook CLIC Accelerator: status, plans and outlook
CLIC Accelerator: status, plans and outlook
 
An Update on CSCS
An Update on CSCSAn Update on CSCS
An Update on CSCS
 
Introduction to SSTL
Introduction to SSTLIntroduction to SSTL
Introduction to SSTL
 
Zero carbon britain 2030
Zero carbon britain 2030Zero carbon britain 2030
Zero carbon britain 2030
 
TWT MTBF
TWT MTBFTWT MTBF
TWT MTBF
 
Undergrad_Poster
Undergrad_PosterUndergrad_Poster
Undergrad_Poster
 

More from inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Reflecting on the Goal and Baseline of Exascale Computing

  • 1. T. Schulthess| Thomas C. Schulthess !1 Reflecting on the Goal and Baseline of Exascale Computing
  • 2. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves:
  • 3. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves:
  • 4. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1,000-fold performance improvement per decade
  • 5. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1,000-fold performance improvement per decade
  • 6. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained 1,000-fold performance improvement per decade
  • 7. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained 1,000-fold performance improvement per decade KKR-CPA (MST)
  • 8. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained 1,000-fold performance improvement per decade KKR-CPA (MST) LSMS (MST)
  • 9. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained 1,000-fold performance improvement per decade KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
  • 10. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained 1,000-fold performance improvement per decade KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
  • 11. T. Schulthess| Tracking supercomputer performance over time? !2 Ax = bLinpack benchmark solves: 1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained 1,000-fold performance improvement per decade KKR-CPA (MST) LSMS (MST) WL-LSMS (MST) 1,000x perf. improv. per decade seems hold for multiple-scattering-theory(MST)- based electronic structure for materials science
  • 12. T. Schulthess| “Only” 100-fold performance improvement in climate codes !3 Source: Peter Bauer, ECMWFSource: Peter Bauer, ECMWF
  • 13. T. Schulthess| !4 Has the efficiency of weather & climate codes dropped 10-fold every decade?
  • 14. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF
  • 15. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
  • 16. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF Cray Y-MP @ 300kW KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
  • 17. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF Cray Y-MP @ 300kW Cray XT5 @ 7MW KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)
  • 18. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF Cray Y-MP @ 300kW Cray XT5 @ 7MW KKR-CPA (MST) LSMS (MST) WL-LSMS (MST) IBM P5 @ 400 kW
  • 19. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF Cray Y-MP @ 300kW Cray XT5 @ 7MW KKR-CPA (MST) LSMS (MST) WL-LSMS (MST) IBM P5 @ 400 kW IBM P6 @ 1.3 MW
  • 20. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF Cray Y-MP @ 300kW Cray XT5 @ 7MW Cray XT5 @ 1.8 MW KKR-CPA (MST) LSMS (MST) WL-LSMS (MST) IBM P5 @ 400 kW IBM P6 @ 1.3 MW
  • 21. T. Schulthess| Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades) !5 Source: Peter Bauer, ECMWF Cray Y-MP @ 300kW Cray XT5 @ 7MW Cray XT5 @ 1.8 MW System size (in energy footprint) grew 
 much faster on “Top500” systems KKR-CPA (MST) LSMS (MST) WL-LSMS (MST) IBM P5 @ 400 kW IBM P6 @ 1.3 MW
  • 22. T. Schulthess| !6 Source: Christoph Schär, ETH Zurich, & Nils Wedi, ECMWF
  • 23. T. Schulthess| !6 Source: Christoph Schär, ETH Zurich, & Nils Wedi, ECMWF
  • 24. T. Schulthess| !6 Source: Christoph Schär, ETH Zurich, & Nils Wedi, ECMWF Can the delivery of a 1km-scale capability be pulled in by a decade?
  • 25. T. Schulthess| Leadership in weather and climate !7 Peter Bauer, ECMWF
  • 26. T. Schulthess| Leadership in weather and climate !7 European model may be the best – but far away from sufficient accuracy and reliability! Peter Bauer, ECMWF
  • 27. T. Schulthess| The impact of resolution: simulated tropical cyclones !8 130 km 60 km 25 km Observations HADGEM3 PRACE UPSCALE, P.L. Vidale (NCAS) and M. Roberts (MO/HC)
  • 28. T. Schulthess| Resolving convective clouds (convergence?) !9 Source: Christoph Schär, ETH Zurich Bulk convergence Structural convergence Area-averaged bulk effects upon ambient flow: E.g., heating and moistening of cloud layer Statistics of cloud ensemble: E.g., spacing and size of convective clouds
  • 29. T. Schulthess| Structural and bulk convergence !10 relativefrequency 10 −6 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 cloud area [km 2 ] 10 −2 10 0 10 2 10 4 71 64 54 47 43 grid-scale clouds [%] convective mass flux [kg m−2 s−1 ] −10 −5 0 5 10 15 relativefrequency 10 −6 10 −5 10 −4 10 −3 10 −2 10 −1 10 0 10 −7 8 km 4 km 2 km 1 km 500 m Source: Christoph Schär, ETH Zurich Statistics of cloud area Statistics of up- & downdrafts No structural convergence Bulk statistics of updrafts converges Factor 4 (Panosetti et al. 2018)
  • 30. T. Schulthess| !11 What resolution is needed? • There are threshold scales in the atmosphere and ocean: going from 100 km to 10 km is incremental, 10 km to 1 km is a leap. At 1km • it is no longer necessary to parametrise precipitating convection, ocean eddies, or orographic wave drag and its effect on extratropical storms; • ocean bathymetry, overflows and mixing, as well as regional orographic circulation in the atmosphere become resolved; • the connection between the remaining parametrisation are now on a physical footing. • We spend the last five decades in a paradigm of incremental advances. Here we incrementally improved the resolution of models from 200 to 20km • Exascale allows us to make the leap to 1 km. This fundamentally changes the structure of our models. We move from crude parametric presentations to an explicit, physics based, description of essential processes. • The last such step change was fifty years ago. This was when, in the late 1960s, climate scientists first introduced global climate models, which were distinguished by their ability to explicitly represent extra-tropical storms, ocean gyres and boundary current. Bjorn Stevens, MPI-M
  • 31. T. Schulthess| Simulation throughput: Simulate Years Per Day (SPYD) !12 NWP Climate in production Climate spinup Simulation 10 d 100 y 5’000 y Desired wall clock time 0.1 d 0.1 y 0.5 y ratio 100 1'000 10'000 SYPD 0.27 2.7 27
  • 32. T. Schulthess| Simulation throughput: Simulate Years Per Day (SPYD) !12 NWP Climate in production Climate spinup Simulation 10 d 100 y 5’000 y Desired wall clock time 0.1 d 0.1 y 0.5 y ratio 100 1'000 10'000 SYPD 0.27 2.7 27 Minimal throughout 1 SYPD, preferred 5 SYPD
  • 33. T. Schulthess| Summary of intermediate goal (reach by 2021?) !13 Horizontal resolution 1 km (globally quasi-uniform) Vertical resolution 180 levels (surface to ~100 km) Time resolution Less than 1 minute Coupled Land-surface/ocean/ocean-waves/sea-ice Atmosphere Non-hydrostatic Precision Single (32bit) or mixed precision Compute rate 1 SYPD (simulated year wall-clock day)
  • 34. T. Schulthess| Running COSMO 5.0 at global scale on Piz Daint !14 Scaling to full system size: ~5300 GPU accelerate nodes available Running a near-global (±80º covering 97% of Earths surface) COSMO 5.0 simulation & IFS > Either on the hosts processors: Intel Xeon E5 2690v3 (Haswell 12c). > Or on the GPU accelerator: PCIe version ofNVIDIA GP100 (Pascal) GPU
  • 35. T. Schulthess| !15 September 15, 2015 Today’s Outlook: GPU-accelerated Weather Forecasting John Russell “Piz Kesch”
  • 36. T. Schulthess| !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x Requirements from MeteoSwiss 6x
  • 37. T. Schulthess| !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x Requirements from MeteoSwiss 6x We need a 40x improvement between 2012 and 2015 at constant cost
  • 38. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x Requirements from MeteoSwiss 6x We need a 40x improvement between 2012 and 2015 at constant cost
  • 39. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost
  • 40. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost
  • 41. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost
  • 42. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.3x Change in architecture (CPU ! GPU) Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost
  • 43. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.8x Moore’s Law & arch. improvements on x86 2.3x Change in architecture (CPU ! GPU) Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost
  • 44. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.8x Moore’s Law & arch. improvements on x86 2.3x Change in architecture (CPU ! GPU) 1.3x additional processors Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost
  • 45. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.8x Moore’s Law & arch. improvements on x86 2.3x Change in architecture (CPU ! GPU) 1.3x additional processors Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture Bonus: reduction in power! We need a 40x improvement between 2012 and 2015 at constant cost
  • 46. T. Schulthess| Where the factor 40 improvement came from !16 1 5 10 15 20 25 30 35 40 Constant budget for investments and operations Grid 2.2 km ! 1.1 km 24x Ensemble with multiple forecasts Data assimilation 10x 1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.8x Moore’s Law & arch. improvements on x86 2.3x Change in architecture (CPU ! GPU) 1.3x additional processors Requirements from MeteoSwiss 6x Investment in software allowed mathematical improvements and change in architecture There is no silver bullet! Bonus: reduction in power! We need a 40x improvement between 2012 and 2015 at constant cost
  • 47. T. Schulthess| Near-global climate simulation at 1km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 !17 Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
  • 48. T. Schulthess| Near-global climate simulation at 1km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 !17 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100 climate simulation on 4,888 GPUs at 1 km resolution SC’17, November 2017, Denver, CO 440010 100 1000 #nodes 128x128 160x160 192x192 256x256 bility on the hybrid P100 Piz Daint SMO time step of the dry simula- 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes ∆x = 19 km, P100 ∆x = 19 km, Haswell ∆x = 3.7 km, P100 ∆x = 3.7 km, Haswell ∆x = 1.9 km, P100 ∆x = 930 m, P100 (b) Strong scalability on Piz Daint. (Filled sym- bols) On P100 GPUs, and (empty symbols) on Haswell CPUs, using 12 MPI ranks per node. h xi #nodes t [s] SYPD MWh/SY gridpoints 930 m 4,888 6 0.043 596 3.46⇥1010 1.9 km 4,888 12 0.23 97.8 8.64 ⇥ 109 47 km 18 300 9.6 0.099 1.39 ⇥ 107 (c) Time compression (SYPD) and energy cost (MWh/SY) for three moist simulations. At 930 m grid spacing ob- tained with a full 10d simulation, at 1.9 km from 1,000 steps, and at 47 km from 100 steps gure 6: (a) Weak and (b) strong scalability results and (c) summary of the time compression achieved in terms of SYPD. ystem. As argued in Section 3, such large timesteps are ble for global climate simulations resolving convective when using implicit solvers), and a maximum timestep level METIS Q COSMO D no merging ˆD registers 1.51 · 109 (20240) 1.72 · 109 (20160) 2.6 · 109 (112618) sh. mem. 64,800 (245) 107,600 (255) 229,120 (2649) Metric: simulated years per wall-clock day Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
  • 49. T. Schulthess| Near-global climate simulation at 1km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0 !17 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100 climate simulation on 4,888 GPUs at 1 km resolution SC’17, November 2017, Denver, CO 440010 100 1000 #nodes 128x128 160x160 192x192 256x256 bility on the hybrid P100 Piz Daint SMO time step of the dry simula- 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes ∆x = 19 km, P100 ∆x = 19 km, Haswell ∆x = 3.7 km, P100 ∆x = 3.7 km, Haswell ∆x = 1.9 km, P100 ∆x = 930 m, P100 (b) Strong scalability on Piz Daint. (Filled sym- bols) On P100 GPUs, and (empty symbols) on Haswell CPUs, using 12 MPI ranks per node. h xi #nodes t [s] SYPD MWh/SY gridpoints 930 m 4,888 6 0.043 596 3.46⇥1010 1.9 km 4,888 12 0.23 97.8 8.64 ⇥ 109 47 km 18 300 9.6 0.099 1.39 ⇥ 107 (c) Time compression (SYPD) and energy cost (MWh/SY) for three moist simulations. At 930 m grid spacing ob- tained with a full 10d simulation, at 1.9 km from 1,000 steps, and at 47 km from 100 steps gure 6: (a) Weak and (b) strong scalability results and (c) summary of the time compression achieved in terms of SYPD. ystem. As argued in Section 3, such large timesteps are ble for global climate simulations resolving convective when using implicit solvers), and a maximum timestep level METIS Q COSMO D no merging ˆD registers 1.51 · 109 (20240) 1.72 · 109 (20160) 2.6 · 109 (112618) sh. mem. 64,800 (245) 107,600 (255) 229,120 (2649) Metric: simulated years per wall-clock day 2.5x faster than Yang et al.’s 2016 Gordon Bell winner run on TaihuLight! Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
  • 50. T. Schulthess| The baseline for COSMO-global and IFS !18 Near-global COSMO [Fuh2018] Global IFS [Wed2009] Value Shortfall Value Shortfall Horizontal resolution 0.93 km (non- uniform) 0.9x 1.25 km 1.25x Vertical resolution 60 levels (surface to 25 km) 3x 62 levels (surface to 40 km) 3x Time resolution 6 s (split-explicit with sub- stepping) - 120 s (semi-implicit) 4x Coupled No 1.2x No 1.2x Atmosphere Non-hydrostatic - Non-hydrostatic - Precision Double 0.6x Single - Compute rate 0.043 SYPD 23x 0.088 SYPD 11x Other (I/O, full physics, …) Limited I/O Only microphysics 1.5x Full physics, no I/O - Total shortfall 65x 198x
  • 51. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
  • 52. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
  • 53. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018
  • 54. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 0.88
  • 55. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Max achievable BW 0.88
  • 56. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW 0.88
  • 57. T. Schulthess| Memory use efficiency !19 MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW 0.88 0.76
  • 58. T. Schulthess| Memory use efficiency !19 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW 0.88 0.76
  • 59. T. Schulthess| Memory use efficiency !19 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW 0.88 0.76 2x lower than peak BW
  • 60. T. Schulthess| Memory use efficiency !19 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] MUE = I/O e ciency · BW e ciency = Q D B ˆB Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW 0.88 0.76 = 0.67 2x lower than peak BW
  • 61. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20
  • 62. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x
  • 63. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x
  • 64. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x 2. Improving BW efficiency Improve BW efficiency and peak BW 2x (results on Volta show this is realistic)
  • 65. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x 2. Improving BW efficiency Improve BW efficiency and peak BW 2x (results on Volta show this is realistic)
  • 66. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x 2. Improving BW efficiency Improve BW efficiency and peak BW 2x (results on Volta show this is realistic) 3. Weak scaling 4x possible in COSMO, but we reduced 
 available parallelism by factor 2 2x
  • 67. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x 2. Improving BW efficiency Improve BW efficiency and peak BW 2x (results on Volta show this is realistic) 3. Weak scaling 4x possible in COSMO, but we reduced 
 available parallelism by factor 2 2x 4. Remaining reduction in shortfall 4x Numerical algorithms (larger time steps) Further improved processors / memory
  • 68. T. Schulthess| How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global? !20 0 100 200 300 400 500 600 0.1 1000 MemoryBW(GB/s) Data size (MB) 28.2 1,00010010.1 362 10 COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i] 0.01 0.1 1 10 100 488810 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO) 2x fewer grid-columns Time step of 10 ms instead of 5 ms 4x 2. Improving BW efficiency Improve BW efficiency and peak BW 2x (results on Volta show this is realistic) 3. Weak scaling 4x possible in COSMO, but we reduced 
 available parallelism by factor 2 2x 4. Remaining reduction in shortfall 4x Numerical algorithms (larger time steps) Further improved processors / memory But we don’t want to increase the footprint of the 2021 system beyond “Piz Daint”
  • 69. T. Schulthess| The importance of ensembles !21 Peter Bauer, ECMWF
  • 70. T. Schulthess| The importance of ensembles !21 Peter Bauer, ECMWF
  • 71. T. Schulthess| Remaining goals beyond 2021 (by 2024?) !22 1. Improve the throughput to 5 SYPD 2. Reduce the footprint of a single simulation by up to factor 10
  • 72. T. Schulthess| Data challenge: what is all models run at 1 km scale? !23 With current workflow used by the climate community, 
 IPPC at 1 km scale avg. would produce 50 exabytes of data! The only way out would be to analyse the model while the simulation is running New workflow: 1. first set of runs generate model trajectories and checkpoints 2. Reconstruct model trajectories from checkpoints and analyse
 (as often as necessary)
  • 73. T. Schulthess| !24 Summary and Conclusions
  • 74. T. Schulthess| !24 Summary and Conclusions • While flop/s may be good to compare performance in history of computing, it is not a good metric to design the systems of the future
  • 75. T. Schulthess| !24 Summary and Conclusions • While flop/s may be good to compare performance in history of computing, it is not a good metric to design the systems of the future • Given todays challenges use Memory Use Efficiency (MUE) instead MUE = “I/O Efficiency” X “Bandwidth Efficiency”
  • 76. T. Schulthess| !24 Summary and Conclusions • While flop/s may be good to compare performance in history of computing, it is not a good metric to design the systems of the future • Given todays challenges use Memory Use Efficiency (MUE) instead MUE = “I/O Efficiency” X “Bandwidth Efficiency” • Convection resolving weather and climate simulations @ 1 km horizontal resolution • Represents a big leap for quality of weather and climate simulations • Aim is to pull in the milestone by a decade, in the early 2020s rather than 2030s • Desired throughput of 1 SYPD by 2021 and 5 SYPD by mid 2020s • Need a 50-200x performance improvement to reach 2021 goal of 1SYPD
  • 77. T. Schulthess| !24 Summary and Conclusions • While flop/s may be good to compare performance in history of computing, it is not a good metric to design the systems of the future • Given todays challenges use Memory Use Efficiency (MUE) instead MUE = “I/O Efficiency” X “Bandwidth Efficiency” • Convection resolving weather and climate simulations @ 1 km horizontal resolution • Represents a big leap for quality of weather and climate simulations • Aim is to pull in the milestone by a decade, in the early 2020s rather than 2030s • Desired throughput of 1 SYPD by 2021 and 5 SYPD by mid 2020s • Need a 50-200x performance improvement to reach 2021 goal of 1SYPD • System improvement needed • Improve “Bandwidth Efficiency” for regular but non-stride-1 memory access • Improve application software • Further improvement of scalability • Improve algorithms (time integration has potential)
  • 78. T. Schulthess| !25 Collaborators Tim Palmer (U. of Oxford) Christoph Schar (ETH Zurich) Oliver Fuhrer (MeteoSwiss) Peter Bauer (ECMWF) Bjorn Stevens (MPI-M) Torsten Hoefler (ETH Zurich)Nils Wedi (ECMWF)