Energy Efficient Computing using Dynamic Tuning

inside-BigData.com
inside-BigData.comPresident of insideHPC Media um inside-BigData.com
Dynamic Tuning of HPC Applications
Overview
Methodology
• Motivation and Introduction
• READEX project overview
• What can you achieve with static and dynamic tuning
• Tuning of the hardware parameters
• Effect on the energy consumption
• Effect of hardware parameter tuning on kernels with various arithmetic intensity
• Evaluation of complex HPC applications
• BEM4I
• OpenFOAM
• Scalability tests with ESPRESO
• Tuning of the application parameters
Motivation and Introduction
READEX Project & Motivation
• Energy efficiency is critical to current and future systems
• Applications exhibit dynamic behavior
• Changing resource requirements
• Computational characteristics
• Changing load on processors over time
Goal was to create a tools-aided methodology for automatic tuning of parallel applications.
Dynamically adjust system parameters to actual resource requirements
What is dynamic tuning
FREQ=2 GHz
Phase region
Significant region
Significant region
FREQ=1.5 GHz
READEX Tool Suite
1. Instrument application
• Score-P provides different kinds of instrumentation
2. Detect dynamism
• Check whether runtime situations could benefit
from tuning
3. Detect energy saving potential and
configurations (DTA)
• Use tuning plugin and power measurement
infrastructure to search for optimal configuration
• Create tuning model
4. Runtime application tuning (RAT)
• Apply tuning model, use optimal configuration
Periscope Tuning
Framework
READEX
Tuning Plugin
Application
Tuning Model
Score-P
READEX Runtime
LibraryOnline
Access
Interface
Substrate
Plugin
Interface
Parameter
Control Plugin
Energy
Measurements
(HDEEM)
READEX Tool Suite
READEX Test Suite
Consists of benchmarks, proxy apps and complex production
applications
Key features:
• Full set of scripts allows reproducibility of experiments on
• TUD Taurus HSW (HDEEM) and BDW partitions
• IT4I Salomon machine (RAPL)
• Support for Slurm and PBS schedulers
• Automatic savings evaluation
• Performs evaluation of
• hardware and system parameter tuning
• application parameter tuning
• Contains manual instrumentation of significant regions
• using header file à can be adopted to test other tools
Application
type
Application
name
benchmarks or
proxy apps
AMG2013
Blasbench
Kripke
Lulesh
NPB3.3
production
applications
BEM4I
ESPRESO
INDEED
OpenFOAM
What can you expect from static tuning
MANUAL STATIC TUNING
12.6%
PROPOSAL
4.3%
17.6% Test Suite MAX
Test Suite MIN
Test Suite AVG
Software
Static tuning
savings
AMG2013 12.5 %
Blasbench 7.4 %
Kripke 11.5 %
Lulesh 17.6 %
NPB3.3 11.0 %
BEM4I 15.7 %
INDEED 17.6 %
ESPRESO 4.3 %
OpenFOAM 15.9 %
Average 12.6 %
What can you expect from dynamic tuning
Test Suite MAX
Test Suite MIN
Test Suite AVG
proposal goal: up to 30%
Test Suite MAX
MANUAL DYNAMIC TUNING
34.1%
PROPOSAL
Test Suite MIN 8.2%
Test Suite AVG 17.%
Software
Dynamic tuning
savings
AMG2013 12.5 %
Blasbench 15.3 %
Kripke 18.5 %
Lulesh 18.7 %
NPB3.3 11.0%
BEM4I 34.1 %
INDEED 19.5 %
ESPRESO 8.2 %
OpenFOAM 20.1%
Average 17.5 %
Energy savings achieved by static and dynamic tuning
Application
(default is Intel compiler)
(* uses GCC compiler)
HW parameters
Static tuning saving
node energy / time
Dynamic tuning
savings
node energy/time
READEX tunin
savings
node energy/ti
AMG2013 CF, UCF, threads 12.5% / −0.9% N/A 7.0% / −14.0%
Blasbench CF, UCF, threads 7.4% / −0.9% 15.3% / −18.1% 9.9% / −9.2%
Kripke CF, UCF 11.5% / −28.3% 18.8% / − 18.7% 10.5% / −28.9
Lulesh CF, UCF, threads 17.6% / −8.9% 18.7% / −11.7% 18.2% / −25.7
NPB3.3-BT-MZ CF, UCF, threads 11% / −11.3% N/A 10.8% / −12%
BEM4I CF, UCF, threads 15.7% / −6.2% 34.1% / 10.9% 34.0% / 10.9%
INDEED CF, UCF, threads 17.6% / −12.8% 19.5% / −14.2% 19.1% / −17.3
ESPRESO CF, UCF, threads 4.3% / −8.9% 8.2% / −10.1% 7.1% / −12.3%
OpenFOAM CF, UCF 15.9% / −10.5% 20.1% / 11.5% 9.8% / −9.8%
Evaluation of READEX Tool Suite on TUD Taurus Haswell system with HDEEM energy measurements
Key findings:
• Best savings achieved with BEM4I application – up to 34% for energy and 11% for runtime
• In general energy savings are ”paid” by extra runtime
Tuning of the hardware parameters
Hardware parameter tuning
Investigation of impact of CPU uncore frequency tuning on memory bound code:
• Optimal frequency, with low energy consumption, and a small performance impact
Evaluation using STREAM Copy benchmark
Results by TU Dresden under READEX project
Hardware parameter tuning
Effect of changing core frequencies on uncore performance using memory bound code
• Just a small impact on the Bandwidth and Energy
Evaluation using STREAM Copy benchmark
Results by TU Dresden under READEX project
Heatmap of the energy consumption of a stream benchmark for different core and uncore frequencies.
The data array does not fit in the processor’s L3 processor cache
Hardware parameter tuning
L3 Cache Energy efficiency and Bandwidth:
• Different optimal uncore frequencies
Evaluation using STREAM Copy benchmark
Results by TU Dresden under READEX project
Heatmap of the energy consumption of a stream benchmark for different core and uncore frequencies.
The data array does fit in the processor’s L3 processor cache
Effect of hardware parameter tuning on kernels
with various arithmetic intensity
Arithmetic intensity
Static tuning for various arithmetic intensity
Ratio from 1:9
Static tuning for various arithmetic intensity
Ratio from 2:8
Static tuning for various arithmetic intensity
Ratio from 3:7
Static tuning for various arithmetic intensity
Ratio from 4:6
Static tuning for various arithmetic intensity
Ratio from 5:5
Static tuning for various arithmetic intensity
Ratio from 6:4
Static tuning for various arithmetic intensity
Ratio from 7:3
Static tuning for various arithmetic intensity
Ratio from 8:2
Static tuning for various arithmetic intensity
Ratio from 9:1
Hardware parameter tuning
Behavior of the simple application with two kernels
• Low computational intensity – DGEMV
• High computational intensity – DGEMM
• Tuning of three parameters
• Core frequency
• Uncore frequency
• Number of OpenMP threads
• Visualized by RADAR
....
Low CI (DGEMV) High CI (DGEMM)
10 threads
2.2 GHz UCF
1.2 GHz CF
12 threads
1.2 GHz UCF
2.5 GHz CF
Static tuning for both kernels
12 threads
2.2 GHz UCF
2.4 GHz CF
Computenodeenergyconsumption[J]
CPU core frequency [GHz] CPU core frequency [GHz] CPU core frequency [GHz]
Computenodeenergyconsumption[J]
Computenodeenergyconsumption[J]
Note: runtime of both kernels was equal for default settings
Two kernels with
1:1 workload ratio
Energy
consumption
Energy
savings
Default settings 2017J - -
Static optimal 1833J 179J 9%
Dynamic optimal 1612J 221J 12%
Total savings - 400J 20%
Core and uncore frequency tuning under power cap
Experiments description and testbed parameters
Testbed: Broadwell partition of the Galileo
supercomputer in CINECA
• dual socket server
• two 18-core Intel Xeon E5-2697v4 processor
• 2.3 GHz nominal frequency.
• 2.7 GHz turbo frequency when all 18 cores are utilized
• 145W TDP
Key tunable parameters of the 18-core Intel Xeon E5-
2697v4 processor and their respective ranges and steps.
A set of experiments performed on Intel Broadwell Architecture
Tuning of COMPUTE bound workload
• behavior of the platform when running memory bound workload
• under 145 W (TDP level, no power cap)
• three different power cap levels 100 W, 80 W and 60 W.
3,268s 3,268s
3,903s
3,903s
7,409s
3,577s
7,693s
4,379s
3,653s
363,4J
311,8J 311,8J
285,4J
304,2J
271,6J
290,0J
0
50
100
150
200
250
300
350
400
2,5
4,5
6,5
8,5
10,5
12,5
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuningof computebound region under 80W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
8.5% energy savings
8.4% time savings
12.9% energy savings
10.9% time extension
3,268s
3,450s 3,450s
7,411s
3,293s
4,378s
7,698s
363,4J
344,4J 344,4J
300,4J
293,0J
305,4J
271,0J
297J
0
50
100
150
200
250
300
350
400
2,5
4,5
6,5
8,5
10,5
12,5
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuningof computebound region under 100W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
14.9% energy savings
4.5% time savings
21.3% energy savings
21.1% time extension
3,268s
4,944s 4,944s
7,410s
4,849s4,477s
7,692s
4,565s 4,606s
363,4J
296J
295,0J
268,0J
303,0J
270,4J
0
50
100
150
200
250
300
350
400
2,5
4,5
6,5
8,5
10,5
12,5
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuningof computebound region under 60W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - max UCF
EXP7 - DVFS & UCF - min UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - max UCF
EXP7 - DVFS & UCF - min UCF
9.1% energy savings
9.4% time savings
3,268s
3,450s 3,450s
7,411s
3,293s
4,378s
7,698s
363,4J
344,4J
344,4J
300,4J
293,0J
305,4J
271,0J
297J
0
50
100
150
200
250
300
350
400
2,5
4,5
6,5
8,5
10,5
12,5
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuning of compute bound region under 100W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
14.9% energy savings
4.5% time savings
21.3% energy savings
21.1% time extension
Tuning of COMPUTE bound workload under 100W power cap
3,268s
3,268s
3,903s 3,903s
7,409s
3,577s
7,693s
4,379s
3,653s
363,4J
311,8J 311,8J
285,4J
304,2J
271,6J
290,0J
0
50
100
150
200
250
300
350
400
2,5
4,5
6,5
8,5
10,5
12,5
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuning of compute bound region under 80W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - min UCF
EXP7 - DVFS & UCF - max UCF
8.5% energy savings
8.4% time savings
12.9% energy savings
10.9% time extension
Tuning of COMPUTE bound workload under 80W power cap
3,268s
4,944s 4,944s
7,410s
4,849s
4,477s
7,692s
4,565s 4,606s
363,4J
296J
295,0J
268,0J
303,0J
270,4J
0
50
100
150
200
250
300
350
400
2,5
4,5
6,5
8,5
10,5
12,5
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuning of compute bound region under 60W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - max UCF
EXP7 - DVFS & UCF - min UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - max UCF
EXP7 - DVFS & UCF - min UCF
9.1% energy savings
9.4% time savings
Tuning of COMPUTE bound workload under 60W power cap
Observations for COMPUTE bound workload
• To achieve the best possible performance
• the uncore frequency must be reduces to minimum
• 9.4 % performance gain up to and
• 14.9 % lower energy consumption
• If further energy savings are required – use DVFS and lower the core freq.
• up to 21 % of energy savings
• up to 21 % penalty in runtime
• this effect is more visible for higher powercap levels
Tuning of memory bound workload
• behavior of the platform when running memory bound workload
• under 145 W (TDP level, no power cap)
• three different power cap levels 100 W, 80 W and 60 W.
1,886s
1,959s
1,886s
197,6J
188,2J
188,2J
148,6J
115,2J
170J
145,6J
0
50
100
150
200
250
1,8
2,3
2,8
3,3
3,8
4,3
4,8
5,3
5,8
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuningofmemorybound region under 100W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
38.7% energy savings
3.6% time extension
1,886s
1,920s
1,890s
1,959s
197,6J
153,2J
114,4J
146,0J
0
50
100
150
200
250
1,8
2,3
2,8
3,3
3,8
4,3
4,8
5,3
5,8
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuningofmemorybound region under 80W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
21.6% energy savings
3.6% time extension
1,886s
2,475s 2,475s
1,945s
2,397s
1,925s
197,6J
147,8J
116,2J
115,0J
0
50
100
150
200
250
1,8
2,3
2,8
3,3
3,8
4,3
4,8
5,3
5,8
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuningofmemorybound region under 60W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
22.2% energy savings
22.2% time savings
1,886s
1,959s
1,886s
197,6J
188,2J
188,2J
148,6J
115,2J
170J
145,6J
0
50
100
150
200
250
1,8
2,3
2,8
3,3
3,8
4,3
4,8
5,3
5,8
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuning of memory bound region under 100W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
38.7% energy savings
3.6% time extension
Tuning of memory bound workload under 100W power cap
1,886s
1,920s
1,890s
1,959s
197,6J
153,2J
114,4J
146,0J
0
50
100
150
200
250
1,8
2,3
2,8
3,3
3,8
4,3
4,8
5,3
5,8
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuning of memory bound region under 80W power cap
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
21.6% energy savings
3.6% time extension
Tuning of memory bound workload under 80W power cap
1,886s
2,475s 2,475s
1,945s
2,397s
1,925s
197,6J
147,8J
116,2J
115,0J
0
50
100
150
200
250
1,8
2,3
2,8
3,3
3,8
4,3
4,8
5,3
5,8
1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0
Energyconsumption[J]
Runtime[s]
Frequency [GHz]
Tuning of memory bound region under 60W power capEXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
EXP0 - default
EXP1 - default Pcap
EXP5 - DVFS under Pcap
EXP6 - UCF under Pcap
EXP7 - DVFS & UCF - UCF = 2.2GHz
EXP7 - DVFS & UCF - max UCF
22.2% energy savings
22.2% time savings
Tuning of memory bound workload under 60W power cap
Observations for both workloads
Observations for memory bound workload
• Under the power budget lower that 80 W
• DVFS should be set to minimum value
• boost the performance of the uncore part by 22%.
• Tuning of the uncore frequency
• has low effect on the performance
• but a major effect on energy consumption
• between 21% (60 W and 80W) to 38% (100W)
Observations for compute bound workload
• To achieve the best possible performance
• the uncore frequency must be reduces to minimum
• 9.4 % performance gain up to and
• 14.9 % lower energy consumption
• If further energy savings are required – use DVFS and lower the core freq.
• up to 21 % of energy savings
• up to 21 % penalty in runtime
• this effect is more visible for higher powercap levels
Evaluation of complex HPC applications
BEM4I Application
Application runtime
assemble_k
[s]
assemble_v
[s]
gmres_solve
[s]
print_vtu
[s]
main
[s]
default runtime 5.4 5.9 10.2 5.6 27.3
static tuning runtime 9.8 10.6 6.1 2.4 29.0
dynamic tuning runtime 7.0 7.2 7.9 2.1 24.3
static savings [%] -82.3% -79.1% 40.5% 56.8% -6.2%
dynamic savings [%] -30.6% -20.9% 23.2% 62.9% 10.9%
Hardware: dual socket system with 2x12 CPU cores – ”standard HW” in HPC centres
Region description:
• assemble_k and assemble_v – high utilization of vector units, extreme level of
optimization – fully compute bound great utilization of both sockets and all cores
• gmres_solve – uses DGEMV from MKL – memory bound, suffers on NUMA effect;
this routine is more efficient on single socket
• print_vtu – single threaded I/O and network bound region why stores data to a
file on LUSTRE system
”static": {
"FREQUENCY": ”25", <--------- 2.5 GHz
"NUM_THREADS": ”12", <--------- 12 OpenMP threads
"UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz
"assemble_k": {
"FREQUENCY": "23",
"NUM_THREADS": "24",
"UNCORE_FREQUENCY": ”16”
},
"assemble_v": {
"FREQUENCY": ”25",
"NUM_THREADS": "24",
"UNCORE_FREQUENCY": ”14”
},
"gmres_solve": {
"FREQUENCY": ”17",
"NUM_THREADS": ”8",
"UNCORE_FREQUENCY": ”22”
},
"print_vtu": {
"FREQUENCY": "25",
"NUM_THREADS": ”6",
"UNCORE_FREQUENCY": ”24”
}
Compute node energy
assemble_k
[J]
assemble_v
[J]
gmres_solve
[J]
print_vtu
[J]
main
[J]
default energy 1476 1484 2733 1142 6872
static tuning energy 1962 2015 1366 420 5792
dynamic tuning energy 1467 1462 1259 293 4531
static savings [%] -33.8% -35.8% 50.0% 63.2% 15.7%
dynamic savings [%] 0.6% 1.5% 53.9% 74.3% 34.1%
BEM4I Application
Large energy savings is combination of optimal HW settings and runtime savings
due to mitigation of NUMA effect by optimal settings of OpenMP threading
• Without savings in runtime caused by similar application will
• Energy savings approx. 15 – 20%
• Runtime savings approx. -15%
”static": {
"FREQUENCY": ”25", <--------- 2.5 GHz
"NUM_THREADS": ”12", <--------- 12 OpenMP threads
"UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz
"assemble_k": {
"FREQUENCY": "23",
"NUM_THREADS": "24",
"UNCORE_FREQUENCY": ”16”
},
"assemble_v": {
"FREQUENCY": ”25",
"NUM_THREADS": "24",
"UNCORE_FREQUENCY": ”14”
},
"gmres_solve": {
"FREQUENCY": ”17",
"NUM_THREADS": ”8",
"UNCORE_FREQUENCY": ”22”
},
"print_vtu": {
"FREQUENCY": "25",
"NUM_THREADS": ”6",
"UNCORE_FREQUENCY": ”24”
}
OpenFOAM Application
OpenFOAM Energy consumption Energy savings
Default settings 14 231J - -
Static tuning 12 264J 2 264J 15.9%
Dynamic tuning
Total savings
• Computational fluid dynamics
• Finite volume + multigrid solver
OpenFOAM Application
OpenFOAM Energy consumption Energy savings
Default settings 14 231J - -
Static tuning 12 264J 2 264J 15.9%
Dynamic tuning 11 370J 597J 4.8%
Total savings 2 861J 20.1%
• Computational fluid dynamics
• Finite volume + multigrid solver
ESPRESO Application
33% of energy savings
22% of time savings and improved strong scalability
• Structural mechanics code
• Finite element + sparse FETI solver
• Different tuning models for different # of nodes is needed for strong scalability –
workload per node is varies
• Includes dynamic switching overheads
Energy savings analysis for the strong scalability test of the
ESPRESO library when running the cube benchmark
Application parameters tuning
Application parameters tuning of the ESPRESO
50% - 66% against ”reasonable” settings
86% against the worst case
0
50
100
150
200
250
300
0 500 1000 1500 2000 2500 3000 3500
Energyconsumption[kJ]
Configuration index
the “reasonable” settings
the optimal settings
9 parameters
3840 combinations
• FETI METHOD 2x
• PRECONDITIONER 5x
• ITERATIVE SOLVER TYPE 2x
• HFETI type 2x
• NON-UNIFORM PARTS 6x
• REDUNDANT LAGRANGE 2x
• SCALING 2x
• B0_TYPE 2x
• ADAPTIVE PRECISION 2x
Application parameters tuning
Application parameter tuning parameters is very promising
• application configuration parameters are given in the input file
• each setting requires an individual start of the application
• tool performs automatic search of application parameter space
Application
number of parameters tested /
total number of options
Energy savings
compared
to the worst settings
Energy savings compared to
default or reasonable settings
ESPRESO 9 / 3840 86% 50 – 66%
ELMER 1 / 40 97% 50 – 75%
OpenFOAM 2 / 12 24% 8%
INDEED 3 / 12 35% 25%
Thank you
1 von 48

Recomendados

Overview of HPC Interconnects von
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnectsinside-BigData.com
354 views35 Folien
Hardware & Software Platforms for HPC, AI and ML von
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
1.3K views49 Folien
NNSA Explorations: ARM for Supercomputing von
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputinginside-BigData.com
924 views38 Folien
CUDA-Python and RAPIDS for blazing fast scientific computing von
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
1K views82 Folien
Preparing to program Aurora at Exascale - Early experiences and future direct... von
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
1.4K views45 Folien
SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial von
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialSCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial
SCFE 2020 OpenCAPI presentation as part of OpenPWOER TutorialGanesan Narayanasamy
476 views29 Folien

Más contenido relacionado

Was ist angesagt?

State of ARM-based HPC von
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPCinside-BigData.com
767 views18 Folien
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ... von
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
548 views47 Folien
IBM HPC Transformation with AI von
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI Ganesan Narayanasamy
540 views27 Folien
TAU E4S ON OpenPOWER /POWER9 platform von
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformGanesan Narayanasamy
347 views81 Folien
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod... von
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
1K views47 Folien
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV von
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFVDPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFVJim St. Leger
1.9K views33 Folien

Was ist angesagt?(20)

OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ... von Ganesan Narayanasamy
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod... von inside-BigData.com
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV von Jim St. Leger
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFVDPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
DPDK Summit - 08 Sept 2014 - Ericsson - A Multi-Socket Ferrari for NFV
Jim St. Leger1.9K views
Trends in Systems and How to Get Efficient Performance von inside-BigData.com
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
inside-BigData.com725 views
FD.io Vector Packet Processing (VPP) von Kirill Tsym
FD.io Vector Packet Processing (VPP)FD.io Vector Packet Processing (VPP)
FD.io Vector Packet Processing (VPP)
Kirill Tsym8.2K views
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions von inside-BigData.com
Mellanox Announces HDR 200 Gb/s InfiniBand SolutionsMellanox Announces HDR 200 Gb/s InfiniBand Solutions
Mellanox Announces HDR 200 Gb/s InfiniBand Solutions
inside-BigData.com2.1K views
High-Performance and Scalable Designs of Programming Models for Exascale Systems von inside-BigData.com
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
inside-BigData.com2.8K views

Similar a Energy Efficient Computing using Dynamic Tuning

Runtime Methods to Improve Energy Efficiency in HPC Applications von
Runtime Methods to Improve Energy Efficiency in HPC ApplicationsRuntime Methods to Improve Energy Efficiency in HPC Applications
Runtime Methods to Improve Energy Efficiency in HPC ApplicationsFacultad de Informática UCM
257 views36 Folien
E03403027030 von
E03403027030E03403027030
E03403027030theijes
289 views4 Folien
Power Optimization with Efficient Test Logic Partitioning for Full Chip Design von
Power Optimization with Efficient Test Logic Partitioning for Full Chip DesignPower Optimization with Efficient Test Logic Partitioning for Full Chip Design
Power Optimization with Efficient Test Logic Partitioning for Full Chip DesignPankaj Singh
1.2K views19 Folien
Large-Scale Optimization Strategies for Typical HPC Workloads von
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
511 views34 Folien
A Brief Survey of Current Power Limiting Strategies von
A Brief Survey of Current Power Limiting StrategiesA Brief Survey of Current Power Limiting Strategies
A Brief Survey of Current Power Limiting StrategiesIRJET Journal
34 views6 Folien
How to achieve 95%+ Accurate power measurement during architecture exploration? von
How to achieve 95%+ Accurate power measurement during architecture exploration? How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration? Deepak Shankar
27 views37 Folien

Similar a Energy Efficient Computing using Dynamic Tuning(20)

E03403027030 von theijes
E03403027030E03403027030
E03403027030
theijes289 views
Power Optimization with Efficient Test Logic Partitioning for Full Chip Design von Pankaj Singh
Power Optimization with Efficient Test Logic Partitioning for Full Chip DesignPower Optimization with Efficient Test Logic Partitioning for Full Chip Design
Power Optimization with Efficient Test Logic Partitioning for Full Chip Design
Pankaj Singh1.2K views
Large-Scale Optimization Strategies for Typical HPC Workloads von inside-BigData.com
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
inside-BigData.com511 views
A Brief Survey of Current Power Limiting Strategies von IRJET Journal
A Brief Survey of Current Power Limiting StrategiesA Brief Survey of Current Power Limiting Strategies
A Brief Survey of Current Power Limiting Strategies
IRJET Journal34 views
How to achieve 95%+ Accurate power measurement during architecture exploration? von Deepak Shankar
How to achieve 95%+ Accurate power measurement during architecture exploration? How to achieve 95%+ Accurate power measurement during architecture exploration?
How to achieve 95%+ Accurate power measurement during architecture exploration?
Deepak Shankar27 views
Energy Efficiency in Large Scale Systems von Jerry Sheehan
Energy Efficiency in Large Scale SystemsEnergy Efficiency in Large Scale Systems
Energy Efficiency in Large Scale Systems
Jerry Sheehan388 views
Optimizing High Performance Computing Applications for Energy von David Lecomber
Optimizing High Performance Computing Applications for EnergyOptimizing High Performance Computing Applications for Energy
Optimizing High Performance Computing Applications for Energy
David Lecomber1.5K views
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe... von Tulipp. Eu
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embe...
Tulipp. Eu226 views
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R... von Intel IT Center
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Hardware and Software Co-optimization to Make Sure Oracle Fusion Middleware R...
Intel IT Center1.4K views
Six-Core AMD Opteron EE Processor von AMD
Six-Core AMD Opteron EE ProcessorSix-Core AMD Opteron EE Processor
Six-Core AMD Opteron EE Processor
AMD1.1K views
customization of a deep learning accelerator, based on NVDLA von Shien-Chun Luo
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
Shien-Chun Luo947 views
RT15 Berkeley | Introduction to FPGA Power Electronic & Electric Machine real... von OPAL-RT TECHNOLOGIES
RT15 Berkeley | Introduction to FPGA Power Electronic & Electric Machine real...RT15 Berkeley | Introduction to FPGA Power Electronic & Electric Machine real...
RT15 Berkeley | Introduction to FPGA Power Electronic & Electric Machine real...
참여기관_발표자료-국민대학교 201301 정기회의 von DzH QWuynh
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
DzH QWuynh224 views
intel speed-select-technology-base-frequency-enhancing-performance von DESMOND YUEN
intel speed-select-technology-base-frequency-enhancing-performanceintel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performance
DESMOND YUEN720 views

Más de inside-BigData.com

Major Market Shifts in IT von
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in ITinside-BigData.com
5K views27 Folien
Transforming Private 5G Networks von
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
1.3K views14 Folien
The Incorporation of Machine Learning into Scientific Simulations at Lawrence... von
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
1.4K views33 Folien
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ... von
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
709 views53 Folien
HPC Impact: EDA Telemetry Neural Networks von
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
310 views18 Folien
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring von
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
460 views70 Folien

Más de inside-BigData.com(20)

The Incorporation of Machine Learning into Scientific Simulations at Lawrence... von inside-BigData.com
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com1.4K views
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ... von inside-BigData.com
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com709 views
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring von inside-BigData.com
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com460 views
Fugaku Supercomputer joins fight against COVID-19 von inside-BigData.com
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com2.1K views
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD von inside-BigData.com
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com838 views
Versal Premium ACAP for Network and Cloud Acceleration von inside-BigData.com
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com586 views
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently von inside-BigData.com
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com496 views
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc... von inside-BigData.com
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
Efficient Model Selection for Deep Neural Networks on Massively Parallel Proc...
inside-BigData.com249 views
Scientific Applications and Heterogeneous Architectures von inside-BigData.com
Scientific Applications and Heterogeneous ArchitecturesScientific Applications and Heterogeneous Architectures
Scientific Applications and Heterogeneous Architectures
inside-BigData.com165 views
SW/HW co-design for near-term quantum computing von inside-BigData.com
SW/HW co-design for near-term quantum computingSW/HW co-design for near-term quantum computing
SW/HW co-design for near-term quantum computing
inside-BigData.com222 views

Último

The Role of Patterns in the Era of Large Language Models von
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
104 views65 Folien
What is Authentication Active Directory_.pptx von
What is Authentication Active Directory_.pptxWhat is Authentication Active Directory_.pptx
What is Authentication Active Directory_.pptxHeenaMehta35
15 views7 Folien
This talk was not generated with ChatGPT: how AI is changing science von
This talk was not generated with ChatGPT: how AI is changing scienceThis talk was not generated with ChatGPT: how AI is changing science
This talk was not generated with ChatGPT: how AI is changing scienceElena Simperl
34 views13 Folien
Business Analyst Series 2023 - Week 4 Session 8 von
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8DianaGray10
180 views13 Folien
AI + Memoori = AIM von
AI + Memoori = AIMAI + Memoori = AIM
AI + Memoori = AIMMemoori
15 views9 Folien
Mobile Core Solutions & Successful Cases.pdf von
Mobile Core Solutions & Successful Cases.pdfMobile Core Solutions & Successful Cases.pdf
Mobile Core Solutions & Successful Cases.pdfIPLOOK Networks
16 views7 Folien

Último(20)

The Role of Patterns in the Era of Large Language Models von Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li104 views
What is Authentication Active Directory_.pptx von HeenaMehta35
What is Authentication Active Directory_.pptxWhat is Authentication Active Directory_.pptx
What is Authentication Active Directory_.pptx
HeenaMehta3515 views
This talk was not generated with ChatGPT: how AI is changing science von Elena Simperl
This talk was not generated with ChatGPT: how AI is changing scienceThis talk was not generated with ChatGPT: how AI is changing science
This talk was not generated with ChatGPT: how AI is changing science
Elena Simperl34 views
Business Analyst Series 2023 - Week 4 Session 8 von DianaGray10
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8
DianaGray10180 views
AI + Memoori = AIM von Memoori
AI + Memoori = AIMAI + Memoori = AIM
AI + Memoori = AIM
Memoori15 views
Mobile Core Solutions & Successful Cases.pdf von IPLOOK Networks
Mobile Core Solutions & Successful Cases.pdfMobile Core Solutions & Successful Cases.pdf
Mobile Core Solutions & Successful Cases.pdf
IPLOOK Networks16 views
"Running students' code in isolation. The hard way", Yurii Holiuk von Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays38 views
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf von ThomasBronack
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdfBronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
Bronack Skills - Risk Management and SRE v1.0 12-3-2023.pdf
ThomasBronack31 views
The Coming AI Tsunami.pptx von johnhandby
The Coming AI Tsunami.pptxThe Coming AI Tsunami.pptx
The Coming AI Tsunami.pptx
johnhandby14 views
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... von Moses Kemibaro
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Moses Kemibaro38 views
Optimizing Communication to Optimize Human Behavior - LCBM von Yaman Kumar
Optimizing Communication to Optimize Human Behavior - LCBMOptimizing Communication to Optimize Human Behavior - LCBM
Optimizing Communication to Optimize Human Behavior - LCBM
Yaman Kumar39 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs von Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash171 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... von The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Initiating and Advancing Your Strategic GIS Governance Strategy von Safe Software
Initiating and Advancing Your Strategic GIS Governance StrategyInitiating and Advancing Your Strategic GIS Governance Strategy
Initiating and Advancing Your Strategic GIS Governance Strategy
Safe Software198 views
LLMs in Production: Tooling, Process, and Team Structure von Aggregage
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
Aggregage65 views
Future of AR - Facebook Presentation von Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty66 views
Measurecamp Brussels - Synthetic data.pdf von Human37
Measurecamp Brussels - Synthetic data.pdfMeasurecamp Brussels - Synthetic data.pdf
Measurecamp Brussels - Synthetic data.pdf
Human37 27 views

Energy Efficient Computing using Dynamic Tuning

  • 1. Dynamic Tuning of HPC Applications
  • 2. Overview Methodology • Motivation and Introduction • READEX project overview • What can you achieve with static and dynamic tuning • Tuning of the hardware parameters • Effect on the energy consumption • Effect of hardware parameter tuning on kernels with various arithmetic intensity • Evaluation of complex HPC applications • BEM4I • OpenFOAM • Scalability tests with ESPRESO • Tuning of the application parameters
  • 4. READEX Project & Motivation • Energy efficiency is critical to current and future systems • Applications exhibit dynamic behavior • Changing resource requirements • Computational characteristics • Changing load on processors over time Goal was to create a tools-aided methodology for automatic tuning of parallel applications. Dynamically adjust system parameters to actual resource requirements
  • 5. What is dynamic tuning FREQ=2 GHz Phase region Significant region Significant region FREQ=1.5 GHz
  • 6. READEX Tool Suite 1. Instrument application • Score-P provides different kinds of instrumentation 2. Detect dynamism • Check whether runtime situations could benefit from tuning 3. Detect energy saving potential and configurations (DTA) • Use tuning plugin and power measurement infrastructure to search for optimal configuration • Create tuning model 4. Runtime application tuning (RAT) • Apply tuning model, use optimal configuration Periscope Tuning Framework READEX Tuning Plugin Application Tuning Model Score-P READEX Runtime LibraryOnline Access Interface Substrate Plugin Interface Parameter Control Plugin Energy Measurements (HDEEM) READEX Tool Suite
  • 7. READEX Test Suite Consists of benchmarks, proxy apps and complex production applications Key features: • Full set of scripts allows reproducibility of experiments on • TUD Taurus HSW (HDEEM) and BDW partitions • IT4I Salomon machine (RAPL) • Support for Slurm and PBS schedulers • Automatic savings evaluation • Performs evaluation of • hardware and system parameter tuning • application parameter tuning • Contains manual instrumentation of significant regions • using header file à can be adopted to test other tools Application type Application name benchmarks or proxy apps AMG2013 Blasbench Kripke Lulesh NPB3.3 production applications BEM4I ESPRESO INDEED OpenFOAM
  • 8. What can you expect from static tuning MANUAL STATIC TUNING 12.6% PROPOSAL 4.3% 17.6% Test Suite MAX Test Suite MIN Test Suite AVG Software Static tuning savings AMG2013 12.5 % Blasbench 7.4 % Kripke 11.5 % Lulesh 17.6 % NPB3.3 11.0 % BEM4I 15.7 % INDEED 17.6 % ESPRESO 4.3 % OpenFOAM 15.9 % Average 12.6 %
  • 9. What can you expect from dynamic tuning Test Suite MAX Test Suite MIN Test Suite AVG proposal goal: up to 30% Test Suite MAX MANUAL DYNAMIC TUNING 34.1% PROPOSAL Test Suite MIN 8.2% Test Suite AVG 17.% Software Dynamic tuning savings AMG2013 12.5 % Blasbench 15.3 % Kripke 18.5 % Lulesh 18.7 % NPB3.3 11.0% BEM4I 34.1 % INDEED 19.5 % ESPRESO 8.2 % OpenFOAM 20.1% Average 17.5 %
  • 10. Energy savings achieved by static and dynamic tuning Application (default is Intel compiler) (* uses GCC compiler) HW parameters Static tuning saving node energy / time Dynamic tuning savings node energy/time READEX tunin savings node energy/ti AMG2013 CF, UCF, threads 12.5% / −0.9% N/A 7.0% / −14.0% Blasbench CF, UCF, threads 7.4% / −0.9% 15.3% / −18.1% 9.9% / −9.2% Kripke CF, UCF 11.5% / −28.3% 18.8% / − 18.7% 10.5% / −28.9 Lulesh CF, UCF, threads 17.6% / −8.9% 18.7% / −11.7% 18.2% / −25.7 NPB3.3-BT-MZ CF, UCF, threads 11% / −11.3% N/A 10.8% / −12% BEM4I CF, UCF, threads 15.7% / −6.2% 34.1% / 10.9% 34.0% / 10.9% INDEED CF, UCF, threads 17.6% / −12.8% 19.5% / −14.2% 19.1% / −17.3 ESPRESO CF, UCF, threads 4.3% / −8.9% 8.2% / −10.1% 7.1% / −12.3% OpenFOAM CF, UCF 15.9% / −10.5% 20.1% / 11.5% 9.8% / −9.8% Evaluation of READEX Tool Suite on TUD Taurus Haswell system with HDEEM energy measurements Key findings: • Best savings achieved with BEM4I application – up to 34% for energy and 11% for runtime • In general energy savings are ”paid” by extra runtime
  • 11. Tuning of the hardware parameters
  • 12. Hardware parameter tuning Investigation of impact of CPU uncore frequency tuning on memory bound code: • Optimal frequency, with low energy consumption, and a small performance impact Evaluation using STREAM Copy benchmark Results by TU Dresden under READEX project
  • 13. Hardware parameter tuning Effect of changing core frequencies on uncore performance using memory bound code • Just a small impact on the Bandwidth and Energy Evaluation using STREAM Copy benchmark Results by TU Dresden under READEX project Heatmap of the energy consumption of a stream benchmark for different core and uncore frequencies. The data array does not fit in the processor’s L3 processor cache
  • 14. Hardware parameter tuning L3 Cache Energy efficiency and Bandwidth: • Different optimal uncore frequencies Evaluation using STREAM Copy benchmark Results by TU Dresden under READEX project Heatmap of the energy consumption of a stream benchmark for different core and uncore frequencies. The data array does fit in the processor’s L3 processor cache
  • 15. Effect of hardware parameter tuning on kernels with various arithmetic intensity
  • 17. Static tuning for various arithmetic intensity Ratio from 1:9
  • 18. Static tuning for various arithmetic intensity Ratio from 2:8
  • 19. Static tuning for various arithmetic intensity Ratio from 3:7
  • 20. Static tuning for various arithmetic intensity Ratio from 4:6
  • 21. Static tuning for various arithmetic intensity Ratio from 5:5
  • 22. Static tuning for various arithmetic intensity Ratio from 6:4
  • 23. Static tuning for various arithmetic intensity Ratio from 7:3
  • 24. Static tuning for various arithmetic intensity Ratio from 8:2
  • 25. Static tuning for various arithmetic intensity Ratio from 9:1
  • 26. Hardware parameter tuning Behavior of the simple application with two kernels • Low computational intensity – DGEMV • High computational intensity – DGEMM • Tuning of three parameters • Core frequency • Uncore frequency • Number of OpenMP threads • Visualized by RADAR .... Low CI (DGEMV) High CI (DGEMM) 10 threads 2.2 GHz UCF 1.2 GHz CF 12 threads 1.2 GHz UCF 2.5 GHz CF Static tuning for both kernels 12 threads 2.2 GHz UCF 2.4 GHz CF Computenodeenergyconsumption[J] CPU core frequency [GHz] CPU core frequency [GHz] CPU core frequency [GHz] Computenodeenergyconsumption[J] Computenodeenergyconsumption[J] Note: runtime of both kernels was equal for default settings Two kernels with 1:1 workload ratio Energy consumption Energy savings Default settings 2017J - - Static optimal 1833J 179J 9% Dynamic optimal 1612J 221J 12% Total savings - 400J 20%
  • 27. Core and uncore frequency tuning under power cap
  • 28. Experiments description and testbed parameters Testbed: Broadwell partition of the Galileo supercomputer in CINECA • dual socket server • two 18-core Intel Xeon E5-2697v4 processor • 2.3 GHz nominal frequency. • 2.7 GHz turbo frequency when all 18 cores are utilized • 145W TDP Key tunable parameters of the 18-core Intel Xeon E5- 2697v4 processor and their respective ranges and steps. A set of experiments performed on Intel Broadwell Architecture
  • 29. Tuning of COMPUTE bound workload • behavior of the platform when running memory bound workload • under 145 W (TDP level, no power cap) • three different power cap levels 100 W, 80 W and 60 W. 3,268s 3,268s 3,903s 3,903s 7,409s 3,577s 7,693s 4,379s 3,653s 363,4J 311,8J 311,8J 285,4J 304,2J 271,6J 290,0J 0 50 100 150 200 250 300 350 400 2,5 4,5 6,5 8,5 10,5 12,5 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuningof computebound region under 80W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF 8.5% energy savings 8.4% time savings 12.9% energy savings 10.9% time extension 3,268s 3,450s 3,450s 7,411s 3,293s 4,378s 7,698s 363,4J 344,4J 344,4J 300,4J 293,0J 305,4J 271,0J 297J 0 50 100 150 200 250 300 350 400 2,5 4,5 6,5 8,5 10,5 12,5 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuningof computebound region under 100W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF 14.9% energy savings 4.5% time savings 21.3% energy savings 21.1% time extension 3,268s 4,944s 4,944s 7,410s 4,849s4,477s 7,692s 4,565s 4,606s 363,4J 296J 295,0J 268,0J 303,0J 270,4J 0 50 100 150 200 250 300 350 400 2,5 4,5 6,5 8,5 10,5 12,5 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuningof computebound region under 60W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - max UCF EXP7 - DVFS & UCF - min UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - max UCF EXP7 - DVFS & UCF - min UCF 9.1% energy savings 9.4% time savings
  • 30. 3,268s 3,450s 3,450s 7,411s 3,293s 4,378s 7,698s 363,4J 344,4J 344,4J 300,4J 293,0J 305,4J 271,0J 297J 0 50 100 150 200 250 300 350 400 2,5 4,5 6,5 8,5 10,5 12,5 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuning of compute bound region under 100W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF 14.9% energy savings 4.5% time savings 21.3% energy savings 21.1% time extension Tuning of COMPUTE bound workload under 100W power cap
  • 31. 3,268s 3,268s 3,903s 3,903s 7,409s 3,577s 7,693s 4,379s 3,653s 363,4J 311,8J 311,8J 285,4J 304,2J 271,6J 290,0J 0 50 100 150 200 250 300 350 400 2,5 4,5 6,5 8,5 10,5 12,5 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuning of compute bound region under 80W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - min UCF EXP7 - DVFS & UCF - max UCF 8.5% energy savings 8.4% time savings 12.9% energy savings 10.9% time extension Tuning of COMPUTE bound workload under 80W power cap
  • 32. 3,268s 4,944s 4,944s 7,410s 4,849s 4,477s 7,692s 4,565s 4,606s 363,4J 296J 295,0J 268,0J 303,0J 270,4J 0 50 100 150 200 250 300 350 400 2,5 4,5 6,5 8,5 10,5 12,5 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuning of compute bound region under 60W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - max UCF EXP7 - DVFS & UCF - min UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - max UCF EXP7 - DVFS & UCF - min UCF 9.1% energy savings 9.4% time savings Tuning of COMPUTE bound workload under 60W power cap
  • 33. Observations for COMPUTE bound workload • To achieve the best possible performance • the uncore frequency must be reduces to minimum • 9.4 % performance gain up to and • 14.9 % lower energy consumption • If further energy savings are required – use DVFS and lower the core freq. • up to 21 % of energy savings • up to 21 % penalty in runtime • this effect is more visible for higher powercap levels
  • 34. Tuning of memory bound workload • behavior of the platform when running memory bound workload • under 145 W (TDP level, no power cap) • three different power cap levels 100 W, 80 W and 60 W. 1,886s 1,959s 1,886s 197,6J 188,2J 188,2J 148,6J 115,2J 170J 145,6J 0 50 100 150 200 250 1,8 2,3 2,8 3,3 3,8 4,3 4,8 5,3 5,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuningofmemorybound region under 100W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF 38.7% energy savings 3.6% time extension 1,886s 1,920s 1,890s 1,959s 197,6J 153,2J 114,4J 146,0J 0 50 100 150 200 250 1,8 2,3 2,8 3,3 3,8 4,3 4,8 5,3 5,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuningofmemorybound region under 80W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF 21.6% energy savings 3.6% time extension 1,886s 2,475s 2,475s 1,945s 2,397s 1,925s 197,6J 147,8J 116,2J 115,0J 0 50 100 150 200 250 1,8 2,3 2,8 3,3 3,8 4,3 4,8 5,3 5,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuningofmemorybound region under 60W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF 22.2% energy savings 22.2% time savings
  • 35. 1,886s 1,959s 1,886s 197,6J 188,2J 188,2J 148,6J 115,2J 170J 145,6J 0 50 100 150 200 250 1,8 2,3 2,8 3,3 3,8 4,3 4,8 5,3 5,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuning of memory bound region under 100W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF 38.7% energy savings 3.6% time extension Tuning of memory bound workload under 100W power cap
  • 36. 1,886s 1,920s 1,890s 1,959s 197,6J 153,2J 114,4J 146,0J 0 50 100 150 200 250 1,8 2,3 2,8 3,3 3,8 4,3 4,8 5,3 5,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuning of memory bound region under 80W power cap EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF 21.6% energy savings 3.6% time extension Tuning of memory bound workload under 80W power cap
  • 37. 1,886s 2,475s 2,475s 1,945s 2,397s 1,925s 197,6J 147,8J 116,2J 115,0J 0 50 100 150 200 250 1,8 2,3 2,8 3,3 3,8 4,3 4,8 5,3 5,8 1,0 1,2 1,4 1,6 1,8 2,0 2,2 2,4 2,6 2,8 3,0 Energyconsumption[J] Runtime[s] Frequency [GHz] Tuning of memory bound region under 60W power capEXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF EXP0 - default EXP1 - default Pcap EXP5 - DVFS under Pcap EXP6 - UCF under Pcap EXP7 - DVFS & UCF - UCF = 2.2GHz EXP7 - DVFS & UCF - max UCF 22.2% energy savings 22.2% time savings Tuning of memory bound workload under 60W power cap
  • 38. Observations for both workloads Observations for memory bound workload • Under the power budget lower that 80 W • DVFS should be set to minimum value • boost the performance of the uncore part by 22%. • Tuning of the uncore frequency • has low effect on the performance • but a major effect on energy consumption • between 21% (60 W and 80W) to 38% (100W) Observations for compute bound workload • To achieve the best possible performance • the uncore frequency must be reduces to minimum • 9.4 % performance gain up to and • 14.9 % lower energy consumption • If further energy savings are required – use DVFS and lower the core freq. • up to 21 % of energy savings • up to 21 % penalty in runtime • this effect is more visible for higher powercap levels
  • 39. Evaluation of complex HPC applications
  • 40. BEM4I Application Application runtime assemble_k [s] assemble_v [s] gmres_solve [s] print_vtu [s] main [s] default runtime 5.4 5.9 10.2 5.6 27.3 static tuning runtime 9.8 10.6 6.1 2.4 29.0 dynamic tuning runtime 7.0 7.2 7.9 2.1 24.3 static savings [%] -82.3% -79.1% 40.5% 56.8% -6.2% dynamic savings [%] -30.6% -20.9% 23.2% 62.9% 10.9% Hardware: dual socket system with 2x12 CPU cores – ”standard HW” in HPC centres Region description: • assemble_k and assemble_v – high utilization of vector units, extreme level of optimization – fully compute bound great utilization of both sockets and all cores • gmres_solve – uses DGEMV from MKL – memory bound, suffers on NUMA effect; this routine is more efficient on single socket • print_vtu – single threaded I/O and network bound region why stores data to a file on LUSTRE system ”static": { "FREQUENCY": ”25", <--------- 2.5 GHz "NUM_THREADS": ”12", <--------- 12 OpenMP threads "UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz "assemble_k": { "FREQUENCY": "23", "NUM_THREADS": "24", "UNCORE_FREQUENCY": ”16” }, "assemble_v": { "FREQUENCY": ”25", "NUM_THREADS": "24", "UNCORE_FREQUENCY": ”14” }, "gmres_solve": { "FREQUENCY": ”17", "NUM_THREADS": ”8", "UNCORE_FREQUENCY": ”22” }, "print_vtu": { "FREQUENCY": "25", "NUM_THREADS": ”6", "UNCORE_FREQUENCY": ”24” }
  • 41. Compute node energy assemble_k [J] assemble_v [J] gmres_solve [J] print_vtu [J] main [J] default energy 1476 1484 2733 1142 6872 static tuning energy 1962 2015 1366 420 5792 dynamic tuning energy 1467 1462 1259 293 4531 static savings [%] -33.8% -35.8% 50.0% 63.2% 15.7% dynamic savings [%] 0.6% 1.5% 53.9% 74.3% 34.1% BEM4I Application Large energy savings is combination of optimal HW settings and runtime savings due to mitigation of NUMA effect by optimal settings of OpenMP threading • Without savings in runtime caused by similar application will • Energy savings approx. 15 – 20% • Runtime savings approx. -15% ”static": { "FREQUENCY": ”25", <--------- 2.5 GHz "NUM_THREADS": ”12", <--------- 12 OpenMP threads "UNCORE_FREQUENCY": ”22” } <--------- 2.2 GHz "assemble_k": { "FREQUENCY": "23", "NUM_THREADS": "24", "UNCORE_FREQUENCY": ”16” }, "assemble_v": { "FREQUENCY": ”25", "NUM_THREADS": "24", "UNCORE_FREQUENCY": ”14” }, "gmres_solve": { "FREQUENCY": ”17", "NUM_THREADS": ”8", "UNCORE_FREQUENCY": ”22” }, "print_vtu": { "FREQUENCY": "25", "NUM_THREADS": ”6", "UNCORE_FREQUENCY": ”24” }
  • 42. OpenFOAM Application OpenFOAM Energy consumption Energy savings Default settings 14 231J - - Static tuning 12 264J 2 264J 15.9% Dynamic tuning Total savings • Computational fluid dynamics • Finite volume + multigrid solver
  • 43. OpenFOAM Application OpenFOAM Energy consumption Energy savings Default settings 14 231J - - Static tuning 12 264J 2 264J 15.9% Dynamic tuning 11 370J 597J 4.8% Total savings 2 861J 20.1% • Computational fluid dynamics • Finite volume + multigrid solver
  • 44. ESPRESO Application 33% of energy savings 22% of time savings and improved strong scalability • Structural mechanics code • Finite element + sparse FETI solver • Different tuning models for different # of nodes is needed for strong scalability – workload per node is varies • Includes dynamic switching overheads Energy savings analysis for the strong scalability test of the ESPRESO library when running the cube benchmark
  • 46. Application parameters tuning of the ESPRESO 50% - 66% against ”reasonable” settings 86% against the worst case 0 50 100 150 200 250 300 0 500 1000 1500 2000 2500 3000 3500 Energyconsumption[kJ] Configuration index the “reasonable” settings the optimal settings 9 parameters 3840 combinations • FETI METHOD 2x • PRECONDITIONER 5x • ITERATIVE SOLVER TYPE 2x • HFETI type 2x • NON-UNIFORM PARTS 6x • REDUNDANT LAGRANGE 2x • SCALING 2x • B0_TYPE 2x • ADAPTIVE PRECISION 2x
  • 47. Application parameters tuning Application parameter tuning parameters is very promising • application configuration parameters are given in the input file • each setting requires an individual start of the application • tool performs automatic search of application parameter space Application number of parameters tested / total number of options Energy savings compared to the worst settings Energy savings compared to default or reasonable settings ESPRESO 9 / 3840 86% 50 – 66% ELMER 1 / 40 97% 50 – 75% OpenFOAM 2 / 12 24% 8% INDEED 3 / 12 35% 25%