SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Thermal Control Activities



       Andrea Bartolini
       a.bartolini@unibo.it
Outline

•   Thermal Management
•   Thermal Model Learning
•   Model Predictive Controller
•   Computational Sprinting
•   Reliability Borrowing
Thermal Management
Tecnology scaling           software          System integration
High performace                                           Costs
requirements
                     Spatial and temporal          Limitated
                      workload variation          dissipation
    High
   power                                          capabilities
  densities


                        NON UNIFORM:
                power, temperature, performance

   Leakage
   current                                  Reliability lost,
                      Hot spots, thermal        Aging
                     gradients and cycles
Thermal Management
Tecnology scaling           software          System integration
High performace                                           Costs
requirements
                     Spatial and temporal         Limitated
                      workload variation         dissipation
    High
   power                                         capabilities
  densities         Dynamic Approach:

      on-line tuning of system performance and
                     NON UNIFORM:
       temperature through closed-loop control
             power, temperature, performance

   Leakage
   current                                  Reliability lost,
                      Hot spots, thermal        Aging
                     gradients and cycles
DRM - General Architecture
                                         App.1                                    App.N
• System
• Sensors




                                                                                               Thread
                                                                                Thread
                                                      Thread
                                       Thread
                                                                  .......




                                                                                                 N
                                                        N




                                                                                  1
                                         1
                                                ...                                      ...
  – Performance counter
    - PMU
                                         CONTROLLER                   O.S
  – Core temperature    SW
• Actuator - Knobs           HW


                                           L1                   L1                        L1
  – ACPI states              f,v          CPU1                 CPU2                      CPUN
     – P-State -> DVFS
                             PGATING
     – C-State -> PGATING
  – Task allocation                                    L2                                L2
• Controller
                                                                      Network
  – Reactive
     – Threshold/Heuristic                               Simulation snap-shot
                                                                  DRAM
     – Controller theory
 – Predictive
                               TCPU,#L1MISS,#BUSACCESS,CYCLEACTIVE,....
Thermal Controller
                                                                            Model Predictive
                                                                                Controller
                                                                      •Internal prediction:
                                                                                avoid overshoot
                                                                      •Optimization:
                  [Intel®, ISSCC 2007]
                                                                          maximizes performance
                                                                                                 Target
                                                                                               frequency
                                             Classical feed-back      Past input
                                                                       & output    Thermal   Future       +
                                                 controller                         Model    output   -
                                                                           Future            Future
                                         • PID controllers                  input
                                                                                   Optimizer   error
    Threshold based                      • Better than threshold               Cost
                                                                               function Constraint
       controller                          based approach                                         MPC
                                                                      • Centralized
                                         • Cannot prevent overshoot
•T > Tmax  low freq                                                      • aware of neighbor
•T < Tmin  high freq                                                       cores thermal
• cannot prevent overshoot                                                  influence
• thermal cycle                                                           • All at once – MIMO
                                                                            controller
                                                                          • Complexity !!!
Background – Thermal Modeling

                   Taskj                Pj               Tj
                           Power
                           Modello di        Thermal
                                             Modello
                           model
                           potenza           model
                                             Termico




                           P=g(task,f)

                                                       Pn,j
                             Tn,j


task task   task
MPC Robustness
                                                                                  System
                                                                               Identification
MPC needs a Thermal Model
                             Target
                           frequency      Identified State-Space
  Past input                                 Thermal Model
   & output    Thermal   Future       +
                Model    output   -
                                                                                Temperature
       Future            Future
               Optimizer   error                                                                             CoreN
        input
           Cost
           function Constraint
                                                                                                                     Power
                              MPC                                      Corei
                                                           Core1
• Accurate, with low complexity
• Must be known “at priori”                                    multicore
• Depends on user configuration                                                     Workload
                                                                                    execution
• Changes with system ageing                        Training
                                                      tasks
                                                                       Workload         Workload     Workload
“In field” Self-Calibration                                                                     t
                                                                        Workload
                                                                                    t                  CoreN t
                                                  Workload                                    Workload
 • Force test workloads                                        Workload Core            t Workload       t
                                                                                i
 • Measure cores temperatures                                  Core1     t                           t
 • System identification
Centralized Thermal Modelling
        Thermal Modeling
    i7 Server Platform – 4 cores
 Ts = 1ms   - Quantizzation noise


Step response:



Black Box
Centralized Thermal Modelling
        Thermal Modeling
    i7 Server Platform – 4 cores
 Ts = 1ms   - Quantizzation noise


Step response:



Black Box


 Our approach
 LS + physical
  constraints
Distributed Thermal Modelling
            Thermal Modeling
      Single chip cloud computer SCC – 48 cores
    Ts = 100ms     - Measurment noise

Standard ARX:
•    Designed only for process noise !
Distributed Thermal Modelling
            Thermal Modeling
      Single chip cloud computer SCC – 48 cores
    Ts = 100ms     - Measurment noise

Standard ARX:
•    Designed only for process noise !

Bias Compensated ARX
•    Iterativelly estimate the noise
     variance and compensate it in the LS

                                         Residual Correlation @ lag 0..8
Distributed Thermal Controller
                                MPC Controller Core 1
             CPI                                     CPI
                              P1,EC MPC Controller                    f1,TC
             f1,EC     g(·)                          P1,TC   g-1(·)
                                      Linear Model
                              TENV
                               x1     QP Optimiz

  2 states
  per core
                                     Observer                          T1

                                Implicit formulation

      Nonlinear
(Frequency to Power)
                       s.t
                                                                     Linear
                                                             (Power to Temperature)
                                             Classic Luenberger state observer
Outline

•   Thermal Management
•   Thermal Model Learning
•   Model Predictive Controller
•   Computational Sprinting
•   Reliability Borrowing
Computational Sprinting
• TDP statically defined on worse case power    TAMB
   – Considers only the thermal resistance
      package optimized to minimize it
   – Does not allow to power on all the cores
                                                       TCORE




         Core Core Core Core

         Core Core Core Core   PCHIP
         Core Core Core Core                    TDP
         Core Core Core Core
Computational Sprinting
• TDP statically defined on worse case power                TAMB
   – Considers only the thermal resistance
      package optimized to minimize it
   – Does not allow to power on all the cores
                                                                   TCORE
• Application requires maximum performance in short bursts
   – Thermal capacitance = “Heat Buffer”
   – Use heat buffer to run all cores at maximum performance for a short time
     window
                                              TCHIP
   – Triggered on-demand by peak parallel                               Safe
     workload phases & user interaction
    Core Core Core Core   Core Core Core Core

    Core Core Core Core   Core Core Core Core   PCHIP
    Core Core Core Core   Core Core Core Core                              TDP
    Core Core Core Core   Core Core Core Core


     Sequential            Parallel
Computational Sprinting
• TDP statically defined on worse case power                TAMB
   – Considers only the thermal resistance
      package optimized to minimize it             PCM

   – Does not allow to power on all the cores
                                                                   TCORE
• Application requires maximum performance in short bursts
   – Thermal capacitance = “Heat Buffer”
   – Use heat buffer to run all cores at maximum performance for a short time
     window
                                              TCHIP
   – Triggered on-demand by peak parallel                               Safe
     workload phases & user interaction
   – Augment the sprint duration with
     Phase Change Material (PCM)
                                              PCHIP
      • Heat tanks  needs restore phases!!
      • Need to allocate the sprint phases to the                          TDP
        most QoS-critical task !!
Re-Sprinting Controller
Two level - Hierarchical Controller
                   PTARGET,0            PTARGET,i             PTARGET,i                PTARGET,16



    TPCM                                                                                        Ub (●)
    TAMB                                         PCM
                                       Model Predictive Controller
         TCORE,0                  TCORE,i                TCORE,i                  TCORE,N
  TPCM                                       P*CORE,i                 P*CORE,i               P*CORE,16

   TCORE,0                   TCORE,i                TCORE,i                  TCORE,N
  TNEIGH,0 Thermal        TNEIGH,i Thermal          TNEIGH,i Thermal         TNEIGH,N Thermal
            MPC                     MPC                       MPC                      MPC


                   PCORE,0                PCORE,i                  PCORE,i                  PCORE,16
Guaranteed re-sprinting
• one RC cell per core, 𝑃𝑃 𝑇𝑇𝑇𝑇𝑇𝑇 = ∑ 𝑃𝑃𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 ,
Time-varing internal energy bound UB

• M ~ 1-10s => silicon steady-state


         Guaranteed tasks



       PTOT                 N - sprint duration
                             M – re-sprint rate



          ti
                                                     ti+M   ti+M+N
Guaranteed re-sprinting
• one RC cell per core, 𝑃𝑃 𝑇𝑇𝑇𝑇𝑇𝑇 = ∑ 𝑃𝑃𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 ,
Time-varing internal energy bound UB

• M ~ 1-10s => silicon steady-state
       U                                                     UMAX
                               Ub(t)

                               U(t)
                                                      UN

       PTOT

                                                            PMAX
                                          TAMB
                                                             PREST
          ti
                                                     ti+M   ti+M+N
Outline

•   Thermal Management
•   Thermal Model Learning
•   Model Predictive Controller
•   Computational Sprinting
•   Reliability Borrowing
Dynamic Relibility Management
  Idea:
  Tune at run-time the V DD & Temperature to reach a target
  lifetime while minimizing the workload degradation
  State-of-the-art:
                                                         Expected
              Expect workload          Performance
                                                         Reliability
               (… to lifetime)             goal
                                                      @ target lifetime



                         Cut Performance        No      > Reliability
                             VDD, T                       target

Issues:                                                        Yes
1) Workload change in seconds, Reliability in years
2) Different workload                                 Give performance
     ⇒ same performance cut                               VDD, T
          ⇒ different final user QoS
Our Solution – Key innovations
1. Two-level controller @ two different time scales
     –        Long Intervals: predefined interval of time for reliability control
     –        Short intervals: task scheduling periods
2.       Reliability Speculation / Borrowing
     – Flag the task as:
          •     Hard – High QoS (Real Time, latency constrained task)
          •     Soft – Low QoS (Background process)
     – Reliability target updated each long interval
                          => average bound within the long interval
     – Speculation:
          •     Hard Task – run always at the maximum performance
          •     Soft Task – performance constrained by the reliability target


 Soft Task pays the reliability loss induced by the Hard Task
Single Core Controller Architecture

• The complete architecure is composed by two controllers:
   – Long Term Controller (LTC):
       • Monitors the core Reliability and assesses a value of voltage which is
         assumed as a soft constraint for the Short Term Controller. It could
         be entirely model based or sensor aided.
   – Short Term controller (STC):
       • Based on the soft constraint coming from the LTC, it assigns the
         operating voltage and frequency based on the workload
         requirements.




     Reliability
     Sensor
Task Mapping?
Problem:
  • How Task Mapping takes advantage of it?
  • Model Predictive Controller:
     • Schedule the task to minimize the
       thermal controller activations
  • Sprinting:
     • How to flag the tasks?
     • Sprint rooms are finite buffer in time
       - how to use them?
  • Reliability Borrowing:
     • Soft tasks slow-down depend on
       Hard Task Rate – Can we optimize it?

Weitere ähnliche Inhalte

Ähnlich wie Thermal Control Overview

New Approach for Intelligent Motor Control Centers
New Approach for Intelligent Motor Control CentersNew Approach for Intelligent Motor Control Centers
New Approach for Intelligent Motor Control CentersSchneider Electric
 
Mission critical computing by intel
Mission critical computing by intelMission critical computing by intel
Mission critical computing by intelHP ESSN Philippines
 
Computer Simulation of Induction Heating Process
Computer Simulation of Induction Heating ProcessComputer Simulation of Induction Heating Process
Computer Simulation of Induction Heating ProcessFluxtrol Inc.
 
ISCA2021 Tutorial-Methods for Characterization and Analysis of Voltage Margin...
ISCA2021 Tutorial-Methods for Characterization and Analysis of Voltage Margin...ISCA2021 Tutorial-Methods for Characterization and Analysis of Voltage Margin...
ISCA2021 Tutorial-Methods for Characterization and Analysis of Voltage Margin...Behzad Salami
 
EASA Part-66 Module 5.6 : Basic Computer Structure
EASA Part-66 Module  5.6 : Basic Computer StructureEASA Part-66 Module  5.6 : Basic Computer Structure
EASA Part-66 Module 5.6 : Basic Computer Structuresoulstalker
 
aps - product overview
aps - product overviewaps - product overview
aps - product overviewnnorbert
 
Design Verification at D2Audio
Design Verification at D2AudioDesign Verification at D2Audio
Design Verification at D2AudioDVClub
 
Mehta mayur d2_audio_dv_club_verification_flow
Mehta mayur d2_audio_dv_club_verification_flowMehta mayur d2_audio_dv_club_verification_flow
Mehta mayur d2_audio_dv_club_verification_flowObsidian Software
 
D2 audio dv_club_verification_flow
D2 audio dv_club_verification_flowD2 audio dv_club_verification_flow
D2 audio dv_club_verification_flowObsidian Software
 
Retrofitting existing Commerical Buildings for Smart Grid
Retrofitting existing Commerical Buildings for Smart GridRetrofitting existing Commerical Buildings for Smart Grid
Retrofitting existing Commerical Buildings for Smart GridCenter for Sustainable Energy
 
15.00 hr van Hilten
15.00 hr van Hilten15.00 hr van Hilten
15.00 hr van HiltenThemadagen
 
Fast Optimization Intevac
Fast Optimization IntevacFast Optimization Intevac
Fast Optimization Intevacvvk0
 
Energy Micro MCU Catalog
Energy Micro MCU CatalogEnergy Micro MCU Catalog
Energy Micro MCU Catalogifletcher
 
OPAL-RT RT14 Conference: HIL testing using HYPERSIM
OPAL-RT RT14 Conference: HIL testing using HYPERSIMOPAL-RT RT14 Conference: HIL testing using HYPERSIM
OPAL-RT RT14 Conference: HIL testing using HYPERSIMOPAL-RT TECHNOLOGIES
 
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerIntroduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerFörderverein Technische Fakultät
 

Ähnlich wie Thermal Control Overview (20)

New Approach for Intelligent Motor Control Centers
New Approach for Intelligent Motor Control CentersNew Approach for Intelligent Motor Control Centers
New Approach for Intelligent Motor Control Centers
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 
Mission critical computing by intel
Mission critical computing by intelMission critical computing by intel
Mission critical computing by intel
 
Sharam salamian
Sharam salamianSharam salamian
Sharam salamian
 
Computer Simulation of Induction Heating Process
Computer Simulation of Induction Heating ProcessComputer Simulation of Induction Heating Process
Computer Simulation of Induction Heating Process
 
XMC4000 Brochure
XMC4000 BrochureXMC4000 Brochure
XMC4000 Brochure
 
ISCA2021 Tutorial-Methods for Characterization and Analysis of Voltage Margin...
ISCA2021 Tutorial-Methods for Characterization and Analysis of Voltage Margin...ISCA2021 Tutorial-Methods for Characterization and Analysis of Voltage Margin...
ISCA2021 Tutorial-Methods for Characterization and Analysis of Voltage Margin...
 
EASA Part-66 Module 5.6 : Basic Computer Structure
EASA Part-66 Module  5.6 : Basic Computer StructureEASA Part-66 Module  5.6 : Basic Computer Structure
EASA Part-66 Module 5.6 : Basic Computer Structure
 
Smart grid aep_ge
Smart grid aep_geSmart grid aep_ge
Smart grid aep_ge
 
aps - product overview
aps - product overviewaps - product overview
aps - product overview
 
Design Verification at D2Audio
Design Verification at D2AudioDesign Verification at D2Audio
Design Verification at D2Audio
 
Mehta mayur d2_audio_dv_club_verification_flow
Mehta mayur d2_audio_dv_club_verification_flowMehta mayur d2_audio_dv_club_verification_flow
Mehta mayur d2_audio_dv_club_verification_flow
 
D2 audio dv_club_verification_flow
D2 audio dv_club_verification_flowD2 audio dv_club_verification_flow
D2 audio dv_club_verification_flow
 
Retrofitting existing Commerical Buildings for Smart Grid
Retrofitting existing Commerical Buildings for Smart GridRetrofitting existing Commerical Buildings for Smart Grid
Retrofitting existing Commerical Buildings for Smart Grid
 
15.00 hr van Hilten
15.00 hr van Hilten15.00 hr van Hilten
15.00 hr van Hilten
 
Fast Optimization Intevac
Fast Optimization IntevacFast Optimization Intevac
Fast Optimization Intevac
 
Recognizing Sales Opportunities in Protective Relaying
Recognizing Sales Opportunities in Protective RelayingRecognizing Sales Opportunities in Protective Relaying
Recognizing Sales Opportunities in Protective Relaying
 
Energy Micro MCU Catalog
Energy Micro MCU CatalogEnergy Micro MCU Catalog
Energy Micro MCU Catalog
 
OPAL-RT RT14 Conference: HIL testing using HYPERSIM
OPAL-RT RT14 Conference: HIL testing using HYPERSIMOPAL-RT RT14 Conference: HIL testing using HYPERSIM
OPAL-RT RT14 Conference: HIL testing using HYPERSIM
 
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A SupercomputerIntroduction to National Supercomputer center in Tianjin TH-1A Supercomputer
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
 

Kürzlich hochgeladen

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Kürzlich hochgeladen (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Thermal Control Overview

  • 1. Thermal Control Activities Andrea Bartolini a.bartolini@unibo.it
  • 2. Outline • Thermal Management • Thermal Model Learning • Model Predictive Controller • Computational Sprinting • Reliability Borrowing
  • 3. Thermal Management Tecnology scaling software System integration High performace Costs requirements Spatial and temporal Limitated workload variation dissipation High power capabilities densities NON UNIFORM: power, temperature, performance Leakage current Reliability lost, Hot spots, thermal Aging gradients and cycles
  • 4. Thermal Management Tecnology scaling software System integration High performace Costs requirements Spatial and temporal Limitated workload variation dissipation High power capabilities densities Dynamic Approach: on-line tuning of system performance and NON UNIFORM: temperature through closed-loop control power, temperature, performance Leakage current Reliability lost, Hot spots, thermal Aging gradients and cycles
  • 5. DRM - General Architecture App.1 App.N • System • Sensors Thread Thread Thread Thread ....... N N 1 1 ... ... – Performance counter - PMU CONTROLLER O.S – Core temperature SW • Actuator - Knobs HW L1 L1 L1 – ACPI states f,v CPU1 CPU2 CPUN – P-State -> DVFS PGATING – C-State -> PGATING – Task allocation L2 L2 • Controller Network – Reactive – Threshold/Heuristic Simulation snap-shot DRAM – Controller theory – Predictive TCPU,#L1MISS,#BUSACCESS,CYCLEACTIVE,....
  • 6. Thermal Controller Model Predictive Controller •Internal prediction: avoid overshoot •Optimization: [Intel®, ISSCC 2007] maximizes performance Target frequency Classical feed-back Past input & output Thermal Future + controller Model output - Future Future • PID controllers input Optimizer error Threshold based • Better than threshold Cost function Constraint controller based approach MPC • Centralized • Cannot prevent overshoot •T > Tmax  low freq • aware of neighbor •T < Tmin  high freq cores thermal • cannot prevent overshoot influence • thermal cycle • All at once – MIMO controller • Complexity !!!
  • 7. Background – Thermal Modeling Taskj Pj Tj Power Modello di Thermal Modello model potenza model Termico P=g(task,f) Pn,j Tn,j task task task
  • 8. MPC Robustness System Identification MPC needs a Thermal Model Target frequency Identified State-Space Past input Thermal Model & output Thermal Future + Model output - Temperature Future Future Optimizer error CoreN input Cost function Constraint Power MPC Corei Core1 • Accurate, with low complexity • Must be known “at priori” multicore • Depends on user configuration Workload execution • Changes with system ageing Training tasks Workload Workload Workload “In field” Self-Calibration t Workload t CoreN t Workload Workload • Force test workloads Workload Core t Workload t i • Measure cores temperatures Core1 t t • System identification
  • 9. Centralized Thermal Modelling Thermal Modeling i7 Server Platform – 4 cores Ts = 1ms - Quantizzation noise Step response: Black Box
  • 10. Centralized Thermal Modelling Thermal Modeling i7 Server Platform – 4 cores Ts = 1ms - Quantizzation noise Step response: Black Box Our approach LS + physical constraints
  • 11. Distributed Thermal Modelling Thermal Modeling Single chip cloud computer SCC – 48 cores Ts = 100ms - Measurment noise Standard ARX: • Designed only for process noise !
  • 12. Distributed Thermal Modelling Thermal Modeling Single chip cloud computer SCC – 48 cores Ts = 100ms - Measurment noise Standard ARX: • Designed only for process noise ! Bias Compensated ARX • Iterativelly estimate the noise variance and compensate it in the LS Residual Correlation @ lag 0..8
  • 13. Distributed Thermal Controller MPC Controller Core 1 CPI CPI P1,EC MPC Controller f1,TC f1,EC g(·) P1,TC g-1(·) Linear Model TENV x1 QP Optimiz 2 states per core Observer T1 Implicit formulation Nonlinear (Frequency to Power) s.t Linear (Power to Temperature) Classic Luenberger state observer
  • 14. Outline • Thermal Management • Thermal Model Learning • Model Predictive Controller • Computational Sprinting • Reliability Borrowing
  • 15. Computational Sprinting • TDP statically defined on worse case power TAMB – Considers only the thermal resistance  package optimized to minimize it – Does not allow to power on all the cores TCORE Core Core Core Core Core Core Core Core PCHIP Core Core Core Core TDP Core Core Core Core
  • 16. Computational Sprinting • TDP statically defined on worse case power TAMB – Considers only the thermal resistance  package optimized to minimize it – Does not allow to power on all the cores TCORE • Application requires maximum performance in short bursts – Thermal capacitance = “Heat Buffer” – Use heat buffer to run all cores at maximum performance for a short time window TCHIP – Triggered on-demand by peak parallel Safe workload phases & user interaction Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core PCHIP Core Core Core Core Core Core Core Core TDP Core Core Core Core Core Core Core Core Sequential Parallel
  • 17. Computational Sprinting • TDP statically defined on worse case power TAMB – Considers only the thermal resistance  package optimized to minimize it PCM – Does not allow to power on all the cores TCORE • Application requires maximum performance in short bursts – Thermal capacitance = “Heat Buffer” – Use heat buffer to run all cores at maximum performance for a short time window TCHIP – Triggered on-demand by peak parallel Safe workload phases & user interaction – Augment the sprint duration with Phase Change Material (PCM) PCHIP • Heat tanks  needs restore phases!! • Need to allocate the sprint phases to the TDP most QoS-critical task !!
  • 18. Re-Sprinting Controller Two level - Hierarchical Controller PTARGET,0 PTARGET,i PTARGET,i PTARGET,16 TPCM Ub (●) TAMB PCM Model Predictive Controller TCORE,0 TCORE,i TCORE,i TCORE,N TPCM P*CORE,i P*CORE,i P*CORE,16 TCORE,0 TCORE,i TCORE,i TCORE,N TNEIGH,0 Thermal TNEIGH,i Thermal TNEIGH,i Thermal TNEIGH,N Thermal MPC MPC MPC MPC PCORE,0 PCORE,i PCORE,i PCORE,16
  • 19. Guaranteed re-sprinting • one RC cell per core, 𝑃𝑃 𝑇𝑇𝑇𝑇𝑇𝑇 = ∑ 𝑃𝑃𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 , Time-varing internal energy bound UB • M ~ 1-10s => silicon steady-state Guaranteed tasks PTOT N - sprint duration M – re-sprint rate ti ti+M ti+M+N
  • 20. Guaranteed re-sprinting • one RC cell per core, 𝑃𝑃 𝑇𝑇𝑇𝑇𝑇𝑇 = ∑ 𝑃𝑃𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 , Time-varing internal energy bound UB • M ~ 1-10s => silicon steady-state U UMAX Ub(t) U(t) UN PTOT PMAX TAMB PREST ti ti+M ti+M+N
  • 21. Outline • Thermal Management • Thermal Model Learning • Model Predictive Controller • Computational Sprinting • Reliability Borrowing
  • 22. Dynamic Relibility Management Idea: Tune at run-time the V DD & Temperature to reach a target lifetime while minimizing the workload degradation State-of-the-art: Expected Expect workload Performance Reliability (… to lifetime) goal @ target lifetime Cut Performance No > Reliability  VDD, T target Issues: Yes 1) Workload change in seconds, Reliability in years 2) Different workload Give performance ⇒ same performance cut  VDD, T ⇒ different final user QoS
  • 23. Our Solution – Key innovations 1. Two-level controller @ two different time scales – Long Intervals: predefined interval of time for reliability control – Short intervals: task scheduling periods 2. Reliability Speculation / Borrowing – Flag the task as: • Hard – High QoS (Real Time, latency constrained task) • Soft – Low QoS (Background process) – Reliability target updated each long interval => average bound within the long interval – Speculation: • Hard Task – run always at the maximum performance • Soft Task – performance constrained by the reliability target Soft Task pays the reliability loss induced by the Hard Task
  • 24. Single Core Controller Architecture • The complete architecure is composed by two controllers: – Long Term Controller (LTC): • Monitors the core Reliability and assesses a value of voltage which is assumed as a soft constraint for the Short Term Controller. It could be entirely model based or sensor aided. – Short Term controller (STC): • Based on the soft constraint coming from the LTC, it assigns the operating voltage and frequency based on the workload requirements. Reliability Sensor
  • 25. Task Mapping? Problem: • How Task Mapping takes advantage of it? • Model Predictive Controller: • Schedule the task to minimize the thermal controller activations • Sprinting: • How to flag the tasks? • Sprint rooms are finite buffer in time - how to use them? • Reliability Borrowing: • Soft tasks slow-down depend on Hard Task Rate – Can we optimize it?