SlideShare a Scribd company logo
1 of 24
Power Optimization Through
Many-Core Multiprocessing
    Delivering High Performance in a Low Power World

                          ChipEx2012
                            Haydn Povey
            Marketing Director – Implementation & Security
                       ARM Processor Division




                              May 2, 2012
1
Billions of Connected Devices
                                                                                TAM(m)
                                       Form Factor                               2015
                                       Mobile Phones                             1,750
Performance expectations continue to   Media players                                300
                                       Mobile Computers                             750
   increase exponentially but power
                                       Desktop PCs                                  150
     efficiency and scalability are    Digital TV/STB                               500
   becoming formidable challenges      Automotive Infotainment                      100
                                       Other*                                       450
                                       Total                                    4 billion
                                        *Includes PND, photo-frames, etc


                                          ABI Research, IDC, Gartner and ARM forecasts




                        May 2, 2012
 2
Historic Technology Drivers




                      Functionality      Functionality   Functionality
      Functionality
                           $              Power × $       Energy×$



                                                           2010s
     Up to 1980s         1990s           2000s
                                                          Mobile
    Mainframes/mini     The PC           Notebooks       Computing

                                  May 2, 2012
3
Low Power Positioned for the Future
 Going forward low power is necessary
 for everything from microcontroller to servers

 Low power is a design philosophy
      Mindset, style, culture and working practice
      Not something you change or acquire easily
 Low power is a design reality
      ARM is an efficient architecture                Functionality
      None of the legacy or CISC complexity            Energy×$

 Low cost is a design & manufacturing partnership
      Time to volume not time to niche markets          2010s
                                                        Mobile
      Speed-binning not good enough for mass-market   Computing

                                  May 2, 2012
 4
Limitations with Multiprocessing
 Cost of offering the
  peak single thread
  performance on each
  CPU quickly exceeds
  chassis thermal limits


 System and software
  bottlenecks limit overall
  scalability


 Single die integration
  offered some roadmap

                              May 2, 2012
  5
Evolution to Many-Core
 Base theorem
      Simpler and smaller processor designs require exponentially less
       energy to accomplish same amount of compute as a more complex
       and larger processor design.

 “Approximate rule of thumb”
      To increase performance 50% you double the power and area cost of
       the processor design
      Quickly reaches point of diminishing returns




                                  May 2, 2012
 6
Challenge of Many-Core
 Many-core definition
      Use ‘lots’ of smaller, more efficient processors to achieve a higher
       aggregate performance than can be reached through multiprocessing

 Smaller processors are not capable of executing the same
    single thread as a higher performance processor in the same
    time – so can’t execute existing applications effectively

 Many threads can not easily be decomposed into simpler
    smaller tasks so as to benefit from multiprocessing on the
    smaller processor

 Software development challenge


                                  May 2, 2012
7
Software Data Decomposition
     Each data item is independent

                                         TASK      CPU

                                                   CPU

                                                   CPU

                                                   CPU


          TASK     CPU

                                Split large quantity of DATA
          TASK     CPU
                                into smaller chunks that can
          TASK     CPU              be operated in parallel
          TASK     CPU




                         May 2, 2012
8
Software Task Decomposition
     Each task item is functionally independent

  TASK TASK TASK TASK TASK TASK TASK TASK TASK       CPU

                                                     CPU

                                                     CPU

                                                     CPU


TASK TASK TASK      CPU


TASK TASK TASK      CPU       Functionally independent tasks
                               can be executed concurrently
TASK TASK TASK      CPU


TASK TASK TASK      CPU




                            May 2, 2012
 9
Functional Block Partitioning
 Functional blocks are serially dependent
       But temporary independent

 Distribute different functional blocks across
  available processors
       Split into defined functional threads
       Uses passing of data blocks between threads
         to allocate work
 Requires code changes and fine tuning                                                        Example:
                                                                                       Real Time Video Encoding
                                                                       CPU2
                                                                                  Motion
                                                                                  Compensation
 CPU0                                    CPU1                                                              CPU3

      Analogue             Remove                Remove                Quantise   Run-Length      Buffer
      Video                Inter-Frame           Intra-Frame           Samples    Compress        Store
      Sampling             Redundancy            Redundancy



                 (Simplified MPEG encoding functional block diagram)
                                                                                                           TIME


                                                               May 2, 2012
 10
Strategy Focus: The Thermal Wall
 SOC sustained power is limited in mobile devices by thermals;
         1.5W to 2W with low-cost POP and stacked memories
         3W without stacked memories
                                                  Responsiveness is a must
Power




          Burst for responsiveness
               (e.g. Browsing)                               Complex active management is
                                 T >= Tjmax, Tskin             needed

                                           “Opportunistic Residency”
                                                                             Managed Sustained Power


                                                                              Tj >= T max             Tj < Tmax


                                                                        Un-managed Max Power (@Tjmax )
                                           Sustained performance
                                      (e.g. HD Video Record , Gaming)


                                                                         Power Optimised Low End
                                                                          (e.g. e-Mail, Voice, MP3)



                                                     May 2, 2012                               Time
   11
Applying Nominal Use Case
 Typical Day for Smartphone User
         90 min voice calling
         60 min email / social networking
         30 min reading web
         50 min angry birds / other gaming
         90 min jogging while listening to music and
          logging GPS co-ordinates
       10 min video recording
       7 hrs sleep with music alarm clock
       OS typically executing ~28 active processes
           Apps synching in background



                                     May 2, 2012
 12
Use Case Measurements




              May 2, 2012
13
Use Case Conclusion
     Profiled CPU          Minutes            % of CPU
        States                                 Active
     Deep Sleep              1186                n/a
       200MHz                 154               60%
      500 MHz                  69               27%
      800 MHz                  18                7%
      1000 MHz                 4                 2%
      1200 MHz                 10                4%
           If the phone was ARM big.LITTLE™ enabled...

                    Active CPU time
               12%                      big
               88%                   LITTLE


                                     May 2, 2012
14
Big.LITTLE Processing




     Multiprocessing Capable                 Many core Benefits


                               May 2, 2012
15
“big” Processor – Cortex-A15
 ARM Cortex™-A15 Processor
       3.5+ DMIPS/MHz
       1-4 core MPCore™ configurable
 Advanced Capabilities
       Full ARMv7A architecture
          Thumb®-2, TrustZone®, VFP, NEON™
          Virtualization, large address extensions
       AMBA® 4 ACE™ coherency
 High Performance
       Targeting 1.5GHz mobile implementation on 28nm
       Hard Macro Quad-core Implementation @ 2GHz on 28HPM process

                                   May 2, 2012
 16
“LITTLE” Processor – Cortex-A7
 ARM Cortex-A7 Processor
       “LITTLE” to Cortex-A15 “big”
       1-4 core MPCore configurable
 Same Architectural Capabilities
       Full ARMv7A architecture
           Thumb-2, TrustZone, VFP, NEON
           Virtualization, large address extensions
       AMBA 4 ACE Coherency
       ISA identical to Cortex-A15 processor
 High Performance
       Up to 1.2GHz for mobile implementation on 28nm

                                   May 2, 2012
 17
Comparison of big.LITTLE Pipelines




                May 2, 2012
18
Performance Comparison




              May 2, 2012
19
Power Efficiency Comparison




               May 2, 2012
20
Software Use Models
 Big.LITTLE Task Migration – One CPU active
       Migrate between Cortex-A15 and Cortex-A7 depending on
        performance requirements

 Big.LITTLE MP – Both CPUs can be active
       Allocate threads that need high-performance to cortex-A15
       Allocate threads that don’t require high performance to Cortex-A7 for
        best energy efficiency
       AMBA 4 hardware coherency between Cortex-A-15 and Cortex-A7




                                   May 2, 2012
 21
Task Migration Mechanics




                May 2, 2012
22
CCI-400 Cache Coherent Interconnect
AMBA 4 compliant, 128-bit single layer at up to ½ Cortex-A15 frequency

            GIC-400                                        Coherent
                                     Mali-T604
                                                             I/O                                                  CCI-400 2+3 (x3)
                                      Graphics                                     DMA                     LCD
  Quad                                    ACE-Lite
                                                            device
                                                                                                                   2 full AMBA 4 ACE slave
                      Quad
 Cortex-
                  Cortex-A7
                                                                                   Configurable AXI 4/AXI 3/AHB
                                                                                              :
                                                                                            NIC-400                interfaces
  A15                                 ADB-400               ADB-400
     ACE                ACE                                                         AXI 4
                                                                                                                   +3 ACE-Lite I/O coherent
 ADB-400              ADB-400         MMU-400              MMU-400               MMU-400                           slave interfaces
     128b               128b               128b              128b                  128 b
                                                                                                                   x3 master interfaces
     ACE                ACE               ACE-Lite + DVM     ACE-Lite + DVM        ACE-Lite + DVM

                CoreLink™ CCI-400 Cache Coherent Interconnect
                   128 bit @ up to 0.5 Cortex-A15 frequency                                                       CCI interfaces:
                ACE-Lite                       ACE-Lite               ACE-Lite
                                                                                                                   AMBA 4 ACE and ACE-
                 128b                             128b                  128b
                                                                                                                   Lite manage all
                ACE-Lite                       ACE-Lite                 AXI 4
                                                                                 NIC-400
                                                                                                                   coherency, sharability
                                DMC-400
                  PHY                             PHY
                                                                        Configurable AXI 4/AXI 3/AHB/APB
                                                                                   :                               and barriers
               DDR3/2                        DDR3/2                     Other                Other
              LPDDR2/3                      LPDDR2/3                    Slaves               Slaves



                                                                      May 2, 2012
23
Summary
 Multiprocessing enables the scaling of today’s application to
  grow while maintaining single thread performance
    Addresses nicely the multi-tasking of stacked usage scenarios
 Many-core brings the energy advantages of simpler and
  smaller processor but with the challenge of software
  complexity and lack of backwards compatibility with respect
  to single thread performance

 The big.LITTLE processing as delivered by the ARM Cortex-
  A15 and Cortex-A7 offers both the performance and
  compatibility advantages of Multiprocessing along with the
  power efficiency and scalability advantages of many-core
  processing

                                May 2, 2012
 24

More Related Content

What's hot

AMD Opteron 6000 Series Platform Press Presentation
AMD Opteron 6000 Series Platform Press PresentationAMD Opteron 6000 Series Platform Press Presentation
AMD Opteron 6000 Series Platform Press Presentation
AMD
 
Intel Cloud Summit: Intel Platform Update
Intel Cloud Summit: Intel Platform UpdateIntel Cloud Summit: Intel Platform Update
Intel Cloud Summit: Intel Platform Update
IntelAPAC
 
Six-Core AMD Opteron EE Processor
Six-Core AMD Opteron EE ProcessorSix-Core AMD Opteron EE Processor
Six-Core AMD Opteron EE Processor
AMD
 
zEnterprise Reduces Cost Per Workload
zEnterprise Reduces Cost Per WorkloadzEnterprise Reduces Cost Per Workload
zEnterprise Reduces Cost Per Workload
dkang
 
Hp All In 1
Hp All In 1Hp All In 1
Hp All In 1
RBratton
 
Gentek Introduce(en)
Gentek Introduce(en)Gentek Introduce(en)
Gentek Introduce(en)
cloudmmog
 
Road to superior investment protection for mission critical
Road to superior investment protection for mission criticalRoad to superior investment protection for mission critical
Road to superior investment protection for mission critical
HP ESSN Philippines
 
9sept2009 concept electronics
9sept2009 concept electronics9sept2009 concept electronics
9sept2009 concept electronics
Agora Group
 

What's hot (20)

AMD Opteron 6000 Series Platform Press Presentation
AMD Opteron 6000 Series Platform Press PresentationAMD Opteron 6000 Series Platform Press Presentation
AMD Opteron 6000 Series Platform Press Presentation
 
D610 Spec Sheet
D610 Spec SheetD610 Spec Sheet
D610 Spec Sheet
 
Intel Cloud Summit: Intel Platform Update
Intel Cloud Summit: Intel Platform UpdateIntel Cloud Summit: Intel Platform Update
Intel Cloud Summit: Intel Platform Update
 
Novell Support Revealed! An Insider's Peek and Feedback Opportunity
Novell Support Revealed! An Insider's Peek and Feedback OpportunityNovell Support Revealed! An Insider's Peek and Feedback Opportunity
Novell Support Revealed! An Insider's Peek and Feedback Opportunity
 
Six-Core AMD Opteron EE Processor
Six-Core AMD Opteron EE ProcessorSix-Core AMD Opteron EE Processor
Six-Core AMD Opteron EE Processor
 
IBM System Blue Gene/P Data Sheet
IBM System Blue Gene/P Data SheetIBM System Blue Gene/P Data Sheet
IBM System Blue Gene/P Data Sheet
 
zEnterprise Reduces Cost Per Workload
zEnterprise Reduces Cost Per WorkloadzEnterprise Reduces Cost Per Workload
zEnterprise Reduces Cost Per Workload
 
IBM Virtual Desktop Virtualization
IBM Virtual Desktop VirtualizationIBM Virtual Desktop Virtualization
IBM Virtual Desktop Virtualization
 
Presentation from physical to virtual to cloud emc
Presentation   from physical to virtual to cloud emcPresentation   from physical to virtual to cloud emc
Presentation from physical to virtual to cloud emc
 
IBM Smart Business Desktop Cloud - How to optimise the ROI from your desktop ...
IBM Smart Business Desktop Cloud - How to optimise the ROI from your desktop ...IBM Smart Business Desktop Cloud - How to optimise the ROI from your desktop ...
IBM Smart Business Desktop Cloud - How to optimise the ROI from your desktop ...
 
Practical experiences and best practices for SSD and IBM i
Practical experiences and best practices for SSD and IBM iPractical experiences and best practices for SSD and IBM i
Practical experiences and best practices for SSD and IBM i
 
Dukane 8937
Dukane 8937Dukane 8937
Dukane 8937
 
Hp All In 1
Hp All In 1Hp All In 1
Hp All In 1
 
Dme presentation-feb2013v2-1
Dme presentation-feb2013v2-1Dme presentation-feb2013v2-1
Dme presentation-feb2013v2-1
 
Gentek Introduce(en)
Gentek Introduce(en)Gentek Introduce(en)
Gentek Introduce(en)
 
Road to superior investment protection for mission critical
Road to superior investment protection for mission criticalRoad to superior investment protection for mission critical
Road to superior investment protection for mission critical
 
Smarter Computing and Breakthrough IT Economics
Smarter Computing and Breakthrough IT EconomicsSmarter Computing and Breakthrough IT Economics
Smarter Computing and Breakthrough IT Economics
 
Transforming Your Business Through Cloud Computing
Transforming Your Business Through Cloud ComputingTransforming Your Business Through Cloud Computing
Transforming Your Business Through Cloud Computing
 
Infoboom future-storage-aug2011-v3
Infoboom future-storage-aug2011-v3Infoboom future-storage-aug2011-v3
Infoboom future-storage-aug2011-v3
 
9sept2009 concept electronics
9sept2009 concept electronics9sept2009 concept electronics
9sept2009 concept electronics
 

Viewers also liked

Energy consumption in smart phones huda
Energy consumption in smart phones hudaEnergy consumption in smart phones huda
Energy consumption in smart phones huda
Noor Huda
 

Viewers also liked (6)

Energy consumption in smart phones huda
Energy consumption in smart phones hudaEnergy consumption in smart phones huda
Energy consumption in smart phones huda
 
Learn about energy consumption and battery life on Android devices
Learn about energy consumption and battery life on Android devicesLearn about energy consumption and battery life on Android devices
Learn about energy consumption and battery life on Android devices
 
Power optimization for Android apps
Power optimization for Android appsPower optimization for Android apps
Power optimization for Android apps
 
Battery Optimization for Android Apps - Devoxx14
Battery Optimization for Android Apps - Devoxx14Battery Optimization for Android Apps - Devoxx14
Battery Optimization for Android Apps - Devoxx14
 
Mobile GPS Tracking
Mobile GPS TrackingMobile GPS Tracking
Mobile GPS Tracking
 
Project presentation (Loginradius SDK for Android)
Project presentation (Loginradius SDK for Android)Project presentation (Loginradius SDK for Android)
Project presentation (Loginradius SDK for Android)
 

Similar to Power Optimization Through Manycore Multiprocessing

Apcbyschneider 27mai2011-110602085611-phpapp01
Apcbyschneider 27mai2011-110602085611-phpapp01Apcbyschneider 27mai2011-110602085611-phpapp01
Apcbyschneider 27mai2011-110602085611-phpapp01
a4asif
 
05 2012 power_roadshow_software_on_power
05 2012 power_roadshow_software_on_power05 2012 power_roadshow_software_on_power
05 2012 power_roadshow_software_on_power
Gennaro (Rino) Persico
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
Linaro
 
Rocketick accelerated verilog simulations
Rocketick  accelerated verilog simulationsRocketick  accelerated verilog simulations
Rocketick accelerated verilog simulations
chiportal
 

Similar to Power Optimization Through Manycore Multiprocessing (20)

CPU Subsystem Total Power Consumption: Understanding the Factors and Selectin...
CPU Subsystem Total Power Consumption: Understanding the Factors and Selectin...CPU Subsystem Total Power Consumption: Understanding the Factors and Selectin...
CPU Subsystem Total Power Consumption: Understanding the Factors and Selectin...
 
Cell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationCell Technology for Graphics and Visualization
Cell Technology for Graphics and Visualization
 
Architecting for Hyper-Scale Datacenter Efficiency
Architecting for Hyper-Scale Datacenter EfficiencyArchitecting for Hyper-Scale Datacenter Efficiency
Architecting for Hyper-Scale Datacenter Efficiency
 
System on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phonesSystem on Chip (SoC) for mobile phones
System on Chip (SoC) for mobile phones
 
Apcbyschneider 27mai2011-110602085611-phpapp01
Apcbyschneider 27mai2011-110602085611-phpapp01Apcbyschneider 27mai2011-110602085611-phpapp01
Apcbyschneider 27mai2011-110602085611-phpapp01
 
05 2012 power_roadshow_software_on_power
05 2012 power_roadshow_software_on_power05 2012 power_roadshow_software_on_power
05 2012 power_roadshow_software_on_power
 
Micro Server Design - Open Compute Project
Micro Server Design - Open Compute ProjectMicro Server Design - Open Compute Project
Micro Server Design - Open Compute Project
 
The SDN Opportunity
The SDN OpportunityThe SDN Opportunity
The SDN Opportunity
 
DCIM
DCIMDCIM
DCIM
 
Linaro connect 2018 keynote final updated
Linaro connect 2018 keynote final updatedLinaro connect 2018 keynote final updated
Linaro connect 2018 keynote final updated
 
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
HKG18-500K1 - Keynote: Dileep Bhandarkar - Emerging Computing Trends in the D...
 
Architecting Cloud Solutions
Architecting Cloud SolutionsArchitecting Cloud Solutions
Architecting Cloud Solutions
 
Meeting SEP 2.0 Compliance: Developing Power Aware Embedded Systems for the M...
Meeting SEP 2.0 Compliance: Developing Power Aware Embedded Systems for the M...Meeting SEP 2.0 Compliance: Developing Power Aware Embedded Systems for the M...
Meeting SEP 2.0 Compliance: Developing Power Aware Embedded Systems for the M...
 
Embedded system
Embedded systemEmbedded system
Embedded system
 
Embeddedsystem
EmbeddedsystemEmbeddedsystem
Embeddedsystem
 
Rocketick accelerated verilog simulations
Rocketick  accelerated verilog simulationsRocketick  accelerated verilog simulations
Rocketick accelerated verilog simulations
 
Architectures for mobile computing dec12
Architectures for mobile computing dec12Architectures for mobile computing dec12
Architectures for mobile computing dec12
 
Webcast: Reduce latency, improve analytics and maximize asset utilization in ...
Webcast: Reduce latency, improve analytics and maximize asset utilization in ...Webcast: Reduce latency, improve analytics and maximize asset utilization in ...
Webcast: Reduce latency, improve analytics and maximize asset utilization in ...
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsCeph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
 
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and ContributionsCeph on Intel: Intel Storage Components, Benchmarks, and Contributions
Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
 

More from chiportal

Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
chiportal
 

More from chiportal (20)

Prof. Zhihua Wang, Tsinghua University, Beijing, China
Prof. Zhihua Wang, Tsinghua University, Beijing, China Prof. Zhihua Wang, Tsinghua University, Beijing, China
Prof. Zhihua Wang, Tsinghua University, Beijing, China
 
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
 
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
Prof. Steve Furber, University of Manchester, Principal Designer of the BBC M...
 
Prof. Uri Weiser,Technion
Prof. Uri Weiser,TechnionProf. Uri Weiser,Technion
Prof. Uri Weiser,Technion
 
Ken Liao, Senior Associate VP, Faraday
Ken Liao, Senior Associate VP, FaradayKen Liao, Senior Associate VP, Faraday
Ken Liao, Senior Associate VP, Faraday
 
Prof. Danny Raz, Director, Bell Labs Israel, Nokia
 Prof. Danny Raz, Director, Bell Labs Israel, Nokia  Prof. Danny Raz, Director, Bell Labs Israel, Nokia
Prof. Danny Raz, Director, Bell Labs Israel, Nokia
 
Marco Casale-Rossi, Product Mktg. Manager, Synopsys
Marco Casale-Rossi, Product Mktg. Manager, SynopsysMarco Casale-Rossi, Product Mktg. Manager, Synopsys
Marco Casale-Rossi, Product Mktg. Manager, Synopsys
 
Dr.Efraim Aharoni, ESD Leader, TowerJazz
Dr.Efraim Aharoni, ESD Leader, TowerJazzDr.Efraim Aharoni, ESD Leader, TowerJazz
Dr.Efraim Aharoni, ESD Leader, TowerJazz
 
Eddy Kvetny, System Engineering Group Leader, Intel
Eddy Kvetny, System Engineering Group Leader, IntelEddy Kvetny, System Engineering Group Leader, Intel
Eddy Kvetny, System Engineering Group Leader, Intel
 
Dr. John Bainbridge, Principal Application Architect, NetSpeed
 Dr. John Bainbridge, Principal Application Architect, NetSpeed  Dr. John Bainbridge, Principal Application Architect, NetSpeed
Dr. John Bainbridge, Principal Application Architect, NetSpeed
 
Xavier van Ruymbeke, App. Engineer, Arteris
Xavier van Ruymbeke, App. Engineer, ArterisXavier van Ruymbeke, App. Engineer, Arteris
Xavier van Ruymbeke, App. Engineer, Arteris
 
Asi Lifshitz, VP R&D, Vtool
Asi Lifshitz, VP R&D, VtoolAsi Lifshitz, VP R&D, Vtool
Asi Lifshitz, VP R&D, Vtool
 
Zvika Rozenshein,General Manager, EngineeringIQ
Zvika Rozenshein,General Manager, EngineeringIQZvika Rozenshein,General Manager, EngineeringIQ
Zvika Rozenshein,General Manager, EngineeringIQ
 
Lewis Chu,Marketing Director,GUC
Lewis Chu,Marketing Director,GUC Lewis Chu,Marketing Director,GUC
Lewis Chu,Marketing Director,GUC
 
Kunal Varshney, VLSI Engineer, Open-Silicon
Kunal Varshney, VLSI Engineer, Open-SiliconKunal Varshney, VLSI Engineer, Open-Silicon
Kunal Varshney, VLSI Engineer, Open-Silicon
 
Gert Goossens,Sen. Director, ASIP Tools, Synopsys
Gert Goossens,Sen. Director, ASIP Tools, SynopsysGert Goossens,Sen. Director, ASIP Tools, Synopsys
Gert Goossens,Sen. Director, ASIP Tools, Synopsys
 
Tuvia Liran, Director of VLSI, Nano Retina
Tuvia Liran, Director of VLSI, Nano RetinaTuvia Liran, Director of VLSI, Nano Retina
Tuvia Liran, Director of VLSI, Nano Retina
 
Sagar Kadam, Lead Software Engineer, Open-Silicon
Sagar Kadam, Lead Software Engineer, Open-SiliconSagar Kadam, Lead Software Engineer, Open-Silicon
Sagar Kadam, Lead Software Engineer, Open-Silicon
 
Ronen Shtayer,Director of ASG Operations & PMO, NXP Semiconductor
Ronen Shtayer,Director of ASG Operations & PMO, NXP SemiconductorRonen Shtayer,Director of ASG Operations & PMO, NXP Semiconductor
Ronen Shtayer,Director of ASG Operations & PMO, NXP Semiconductor
 
Prof. Emanuel Cohen, Technion
Prof. Emanuel Cohen, TechnionProf. Emanuel Cohen, Technion
Prof. Emanuel Cohen, Technion
 

Recently uploaded

Recently uploaded (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Power Optimization Through Manycore Multiprocessing

  • 1. Power Optimization Through Many-Core Multiprocessing Delivering High Performance in a Low Power World ChipEx2012 Haydn Povey Marketing Director – Implementation & Security ARM Processor Division May 2, 2012 1
  • 2. Billions of Connected Devices TAM(m) Form Factor 2015 Mobile Phones 1,750 Performance expectations continue to Media players 300 Mobile Computers 750 increase exponentially but power Desktop PCs 150 efficiency and scalability are Digital TV/STB 500 becoming formidable challenges Automotive Infotainment 100 Other* 450 Total 4 billion *Includes PND, photo-frames, etc ABI Research, IDC, Gartner and ARM forecasts May 2, 2012 2
  • 3. Historic Technology Drivers Functionality Functionality Functionality Functionality $ Power × $ Energy×$ 2010s Up to 1980s 1990s 2000s Mobile Mainframes/mini The PC Notebooks Computing May 2, 2012 3
  • 4. Low Power Positioned for the Future  Going forward low power is necessary for everything from microcontroller to servers  Low power is a design philosophy  Mindset, style, culture and working practice  Not something you change or acquire easily  Low power is a design reality  ARM is an efficient architecture Functionality  None of the legacy or CISC complexity Energy×$  Low cost is a design & manufacturing partnership  Time to volume not time to niche markets 2010s Mobile  Speed-binning not good enough for mass-market Computing May 2, 2012 4
  • 5. Limitations with Multiprocessing  Cost of offering the peak single thread performance on each CPU quickly exceeds chassis thermal limits  System and software bottlenecks limit overall scalability  Single die integration offered some roadmap May 2, 2012 5
  • 6. Evolution to Many-Core  Base theorem  Simpler and smaller processor designs require exponentially less energy to accomplish same amount of compute as a more complex and larger processor design.  “Approximate rule of thumb”  To increase performance 50% you double the power and area cost of the processor design  Quickly reaches point of diminishing returns May 2, 2012 6
  • 7. Challenge of Many-Core  Many-core definition  Use ‘lots’ of smaller, more efficient processors to achieve a higher aggregate performance than can be reached through multiprocessing  Smaller processors are not capable of executing the same single thread as a higher performance processor in the same time – so can’t execute existing applications effectively  Many threads can not easily be decomposed into simpler smaller tasks so as to benefit from multiprocessing on the smaller processor  Software development challenge May 2, 2012 7
  • 8. Software Data Decomposition Each data item is independent TASK CPU CPU CPU CPU TASK CPU Split large quantity of DATA TASK CPU into smaller chunks that can TASK CPU be operated in parallel TASK CPU May 2, 2012 8
  • 9. Software Task Decomposition Each task item is functionally independent TASK TASK TASK TASK TASK TASK TASK TASK TASK CPU CPU CPU CPU TASK TASK TASK CPU TASK TASK TASK CPU Functionally independent tasks can be executed concurrently TASK TASK TASK CPU TASK TASK TASK CPU May 2, 2012 9
  • 10. Functional Block Partitioning  Functional blocks are serially dependent  But temporary independent  Distribute different functional blocks across available processors  Split into defined functional threads  Uses passing of data blocks between threads to allocate work  Requires code changes and fine tuning Example: Real Time Video Encoding CPU2 Motion Compensation CPU0 CPU1 CPU3 Analogue Remove Remove Quantise Run-Length Buffer Video Inter-Frame Intra-Frame Samples Compress Store Sampling Redundancy Redundancy (Simplified MPEG encoding functional block diagram) TIME May 2, 2012 10
  • 11. Strategy Focus: The Thermal Wall  SOC sustained power is limited in mobile devices by thermals;  1.5W to 2W with low-cost POP and stacked memories  3W without stacked memories  Responsiveness is a must Power Burst for responsiveness (e.g. Browsing)  Complex active management is T >= Tjmax, Tskin needed “Opportunistic Residency” Managed Sustained Power Tj >= T max Tj < Tmax Un-managed Max Power (@Tjmax ) Sustained performance (e.g. HD Video Record , Gaming) Power Optimised Low End (e.g. e-Mail, Voice, MP3) May 2, 2012 Time 11
  • 12. Applying Nominal Use Case  Typical Day for Smartphone User  90 min voice calling  60 min email / social networking  30 min reading web  50 min angry birds / other gaming  90 min jogging while listening to music and logging GPS co-ordinates  10 min video recording  7 hrs sleep with music alarm clock  OS typically executing ~28 active processes  Apps synching in background May 2, 2012 12
  • 13. Use Case Measurements May 2, 2012 13
  • 14. Use Case Conclusion Profiled CPU Minutes % of CPU States Active Deep Sleep 1186 n/a 200MHz 154 60% 500 MHz 69 27% 800 MHz 18 7% 1000 MHz 4 2% 1200 MHz 10 4% If the phone was ARM big.LITTLE™ enabled... Active CPU time 12% big 88% LITTLE May 2, 2012 14
  • 15. Big.LITTLE Processing Multiprocessing Capable Many core Benefits May 2, 2012 15
  • 16. “big” Processor – Cortex-A15  ARM Cortex™-A15 Processor  3.5+ DMIPS/MHz  1-4 core MPCore™ configurable  Advanced Capabilities  Full ARMv7A architecture  Thumb®-2, TrustZone®, VFP, NEON™  Virtualization, large address extensions  AMBA® 4 ACE™ coherency  High Performance  Targeting 1.5GHz mobile implementation on 28nm  Hard Macro Quad-core Implementation @ 2GHz on 28HPM process May 2, 2012 16
  • 17. “LITTLE” Processor – Cortex-A7  ARM Cortex-A7 Processor  “LITTLE” to Cortex-A15 “big”  1-4 core MPCore configurable  Same Architectural Capabilities  Full ARMv7A architecture  Thumb-2, TrustZone, VFP, NEON  Virtualization, large address extensions  AMBA 4 ACE Coherency  ISA identical to Cortex-A15 processor  High Performance  Up to 1.2GHz for mobile implementation on 28nm May 2, 2012 17
  • 18. Comparison of big.LITTLE Pipelines May 2, 2012 18
  • 19. Performance Comparison May 2, 2012 19
  • 20. Power Efficiency Comparison May 2, 2012 20
  • 21. Software Use Models  Big.LITTLE Task Migration – One CPU active  Migrate between Cortex-A15 and Cortex-A7 depending on performance requirements  Big.LITTLE MP – Both CPUs can be active  Allocate threads that need high-performance to cortex-A15  Allocate threads that don’t require high performance to Cortex-A7 for best energy efficiency  AMBA 4 hardware coherency between Cortex-A-15 and Cortex-A7 May 2, 2012 21
  • 22. Task Migration Mechanics May 2, 2012 22
  • 23. CCI-400 Cache Coherent Interconnect AMBA 4 compliant, 128-bit single layer at up to ½ Cortex-A15 frequency GIC-400 Coherent Mali-T604 I/O CCI-400 2+3 (x3) Graphics DMA LCD Quad ACE-Lite device  2 full AMBA 4 ACE slave Quad Cortex- Cortex-A7 Configurable AXI 4/AXI 3/AHB : NIC-400 interfaces A15 ADB-400 ADB-400 ACE ACE AXI 4  +3 ACE-Lite I/O coherent ADB-400 ADB-400 MMU-400 MMU-400 MMU-400 slave interfaces 128b 128b 128b 128b 128 b  x3 master interfaces ACE ACE ACE-Lite + DVM ACE-Lite + DVM ACE-Lite + DVM CoreLink™ CCI-400 Cache Coherent Interconnect 128 bit @ up to 0.5 Cortex-A15 frequency CCI interfaces: ACE-Lite ACE-Lite ACE-Lite  AMBA 4 ACE and ACE- 128b 128b 128b Lite manage all ACE-Lite ACE-Lite AXI 4 NIC-400 coherency, sharability DMC-400 PHY PHY Configurable AXI 4/AXI 3/AHB/APB : and barriers DDR3/2 DDR3/2 Other Other LPDDR2/3 LPDDR2/3 Slaves Slaves May 2, 2012 23
  • 24. Summary  Multiprocessing enables the scaling of today’s application to grow while maintaining single thread performance  Addresses nicely the multi-tasking of stacked usage scenarios  Many-core brings the energy advantages of simpler and smaller processor but with the challenge of software complexity and lack of backwards compatibility with respect to single thread performance  The big.LITTLE processing as delivered by the ARM Cortex- A15 and Cortex-A7 offers both the performance and compatibility advantages of Multiprocessing along with the power efficiency and scalability advantages of many-core processing May 2, 2012 24

Editor's Notes

  1. The performance requirements of handsets and other mobile devices continues to grow exponentially with new applications, advanced gaming, and traditional PC-type functionality migrating rapidly to these platforms. While this capability enables the next wave of digital revolution it comes at the price of increased power usage and potential thermal challenges. This presentation will investigate the issues and compromises traditionally required to push performance to the next level, and the challenges we face as an industry if we do not architecturally innovate on the  implementation of  advance systems. We will demonstrate key advances in future processor designs and highlight the advantages and challenges faced as we look to deliver high performance in the low power world.
  2. EXAMPLE: Digital camera sport mode (burst mode). Take a lot of pictures and filter and JPEG on the go. Each picture is an independent work item, and can be processed in parallel. Instead of processing the pictures one at the time, one after the other, you can processes them in parallel. Quicker execution. Then switch-off cores and go to sleep. Low leakage and no dynamic power consumption. ANOTHER EXAMPLE: Complex post-processing on large RAW digital image. You can have more than one thread concurrently acting on the input data, and writing to the output image (reads can overlap).
  3. EXAMPLE: You have more than one application running at the same time. On a single core your multitasking OS will time-slice. On a multi-core things will happen in parallel. They will execute in less time, and be more responsive (ie the UI).
  4. EXAMPLE: VIDEO CODEC: This works because a video codec processes a stream. Within a single frame, and within a group of frames there are all sorts of dependencies BUT this is a stream, so while you are storing the result of a encoded frame, you can already be calculating the maths of the following frames, and sampling the next one and so on... Each core can have a task allocated to it, and the code needs to be modified so that these task synchronise and communicate between each other. Distribute different functional blocks of the decoder across available processors Multi-task pipeline: Eg taskA -&gt; taskB -&gt; (multiple)TaskC -&gt; taskD Split into defined functional threads Uses passing of data blocks between threads to allocate work
  5. Start with cheap package (high thermal resistance :15C/W Thetajb, 30C/W Thetaja) and 60C Tjb (so we use Thetajb) 1.5 to 2W with stacked memory limit (including the memory Tj max 85C). 3W w/o mems (20C advantage to play with assuming 105C max Tj SOC) NB: This is an issue we need to understand a lot better.
  6. What is DVM? Why does the slide say 3 masters and 2 slaves (looks like the other way around)