2. Outline
• Thermal Management
• Thermal Model Learning
• Model Predictive Controller
• Computational Sprinting
• Reliability Borrowing
3. Thermal Management
Tecnology scaling software System integration
High performace Costs
requirements
Spatial and temporal Limitated
workload variation dissipation
High
power capabilities
densities
NON UNIFORM:
power, temperature, performance
Leakage
current Reliability lost,
Hot spots, thermal Aging
gradients and cycles
4. Thermal Management
Tecnology scaling software System integration
High performace Costs
requirements
Spatial and temporal Limitated
workload variation dissipation
High
power capabilities
densities Dynamic Approach:
on-line tuning of system performance and
NON UNIFORM:
temperature through closed-loop control
power, temperature, performance
Leakage
current Reliability lost,
Hot spots, thermal Aging
gradients and cycles
5. DRM - General Architecture
App.1 App.N
• System
• Sensors
Thread
Thread
Thread
Thread
.......
N
N
1
1
... ...
– Performance counter
- PMU
CONTROLLER O.S
– Core temperature SW
• Actuator - Knobs HW
L1 L1 L1
– ACPI states f,v CPU1 CPU2 CPUN
– P-State -> DVFS
PGATING
– C-State -> PGATING
– Task allocation L2 L2
• Controller
Network
– Reactive
– Threshold/Heuristic Simulation snap-shot
DRAM
– Controller theory
– Predictive
TCPU,#L1MISS,#BUSACCESS,CYCLEACTIVE,....
6. Thermal Controller
Model Predictive
Controller
•Internal prediction:
avoid overshoot
•Optimization:
[Intel®, ISSCC 2007]
maximizes performance
Target
frequency
Classical feed-back Past input
& output Thermal Future +
controller Model output -
Future Future
• PID controllers input
Optimizer error
Threshold based • Better than threshold Cost
function Constraint
controller based approach MPC
• Centralized
• Cannot prevent overshoot
•T > Tmax low freq • aware of neighbor
•T < Tmin high freq cores thermal
• cannot prevent overshoot influence
• thermal cycle • All at once – MIMO
controller
• Complexity !!!
7. Background – Thermal Modeling
Taskj Pj Tj
Power
Modello di Thermal
Modello
model
potenza model
Termico
P=g(task,f)
Pn,j
Tn,j
task task task
8. MPC Robustness
System
Identification
MPC needs a Thermal Model
Target
frequency Identified State-Space
Past input Thermal Model
& output Thermal Future +
Model output -
Temperature
Future Future
Optimizer error CoreN
input
Cost
function Constraint
Power
MPC Corei
Core1
• Accurate, with low complexity
• Must be known “at priori” multicore
• Depends on user configuration Workload
execution
• Changes with system ageing Training
tasks
Workload Workload Workload
“In field” Self-Calibration t
Workload
t CoreN t
Workload Workload
• Force test workloads Workload Core t Workload t
i
• Measure cores temperatures Core1 t t
• System identification
10. Centralized Thermal Modelling
Thermal Modeling
i7 Server Platform – 4 cores
Ts = 1ms - Quantizzation noise
Step response:
Black Box
Our approach
LS + physical
constraints
11. Distributed Thermal Modelling
Thermal Modeling
Single chip cloud computer SCC – 48 cores
Ts = 100ms - Measurment noise
Standard ARX:
• Designed only for process noise !
12. Distributed Thermal Modelling
Thermal Modeling
Single chip cloud computer SCC – 48 cores
Ts = 100ms - Measurment noise
Standard ARX:
• Designed only for process noise !
Bias Compensated ARX
• Iterativelly estimate the noise
variance and compensate it in the LS
Residual Correlation @ lag 0..8
13. Distributed Thermal Controller
MPC Controller Core 1
CPI CPI
P1,EC MPC Controller f1,TC
f1,EC g(·) P1,TC g-1(·)
Linear Model
TENV
x1 QP Optimiz
2 states
per core
Observer T1
Implicit formulation
Nonlinear
(Frequency to Power)
s.t
Linear
(Power to Temperature)
Classic Luenberger state observer
14. Outline
• Thermal Management
• Thermal Model Learning
• Model Predictive Controller
• Computational Sprinting
• Reliability Borrowing
15. Computational Sprinting
• TDP statically defined on worse case power TAMB
– Considers only the thermal resistance
package optimized to minimize it
– Does not allow to power on all the cores
TCORE
Core Core Core Core
Core Core Core Core PCHIP
Core Core Core Core TDP
Core Core Core Core
16. Computational Sprinting
• TDP statically defined on worse case power TAMB
– Considers only the thermal resistance
package optimized to minimize it
– Does not allow to power on all the cores
TCORE
• Application requires maximum performance in short bursts
– Thermal capacitance = “Heat Buffer”
– Use heat buffer to run all cores at maximum performance for a short time
window
TCHIP
– Triggered on-demand by peak parallel Safe
workload phases & user interaction
Core Core Core Core Core Core Core Core
Core Core Core Core Core Core Core Core PCHIP
Core Core Core Core Core Core Core Core TDP
Core Core Core Core Core Core Core Core
Sequential Parallel
17. Computational Sprinting
• TDP statically defined on worse case power TAMB
– Considers only the thermal resistance
package optimized to minimize it PCM
– Does not allow to power on all the cores
TCORE
• Application requires maximum performance in short bursts
– Thermal capacitance = “Heat Buffer”
– Use heat buffer to run all cores at maximum performance for a short time
window
TCHIP
– Triggered on-demand by peak parallel Safe
workload phases & user interaction
– Augment the sprint duration with
Phase Change Material (PCM)
PCHIP
• Heat tanks needs restore phases!!
• Need to allocate the sprint phases to the TDP
most QoS-critical task !!
19. Guaranteed re-sprinting
• one RC cell per core, 𝑃𝑃 𝑇𝑇𝑇𝑇𝑇𝑇 = ∑ 𝑃𝑃𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 ,
Time-varing internal energy bound UB
• M ~ 1-10s => silicon steady-state
Guaranteed tasks
PTOT N - sprint duration
M – re-sprint rate
ti
ti+M ti+M+N
20. Guaranteed re-sprinting
• one RC cell per core, 𝑃𝑃 𝑇𝑇𝑇𝑇𝑇𝑇 = ∑ 𝑃𝑃𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 ,
Time-varing internal energy bound UB
• M ~ 1-10s => silicon steady-state
U UMAX
Ub(t)
U(t)
UN
PTOT
PMAX
TAMB
PREST
ti
ti+M ti+M+N
21. Outline
• Thermal Management
• Thermal Model Learning
• Model Predictive Controller
• Computational Sprinting
• Reliability Borrowing
22. Dynamic Relibility Management
Idea:
Tune at run-time the V DD & Temperature to reach a target
lifetime while minimizing the workload degradation
State-of-the-art:
Expected
Expect workload Performance
Reliability
(… to lifetime) goal
@ target lifetime
Cut Performance No > Reliability
VDD, T target
Issues: Yes
1) Workload change in seconds, Reliability in years
2) Different workload Give performance
⇒ same performance cut VDD, T
⇒ different final user QoS
23. Our Solution – Key innovations
1. Two-level controller @ two different time scales
– Long Intervals: predefined interval of time for reliability control
– Short intervals: task scheduling periods
2. Reliability Speculation / Borrowing
– Flag the task as:
• Hard – High QoS (Real Time, latency constrained task)
• Soft – Low QoS (Background process)
– Reliability target updated each long interval
=> average bound within the long interval
– Speculation:
• Hard Task – run always at the maximum performance
• Soft Task – performance constrained by the reliability target
Soft Task pays the reliability loss induced by the Hard Task
24. Single Core Controller Architecture
• The complete architecure is composed by two controllers:
– Long Term Controller (LTC):
• Monitors the core Reliability and assesses a value of voltage which is
assumed as a soft constraint for the Short Term Controller. It could
be entirely model based or sensor aided.
– Short Term controller (STC):
• Based on the soft constraint coming from the LTC, it assigns the
operating voltage and frequency based on the workload
requirements.
Reliability
Sensor
25. Task Mapping?
Problem:
• How Task Mapping takes advantage of it?
• Model Predictive Controller:
• Schedule the task to minimize the
thermal controller activations
• Sprinting:
• How to flag the tasks?
• Sprint rooms are finite buffer in time
- how to use them?
• Reliability Borrowing:
• Soft tasks slow-down depend on
Hard Task Rate – Can we optimize it?