FUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGN
1. FUSION APU AND TRENDS/
CHALLENGES IN FUTURE
SOC (PROCESSOR) DESIGN
Pankaj Singh,
Acknowledgement:
Denis Foley. Sr. Fellow, AMD
9th International SoC Conference
2nd & 3rd November 2011
2. 2 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
TODAY’S TOPICS
Trends:
– Three Eras of Processor Performance
– Evolution of Heterogeneous Computing
FSA and Open Standard:
– Why Fusion ?
– Open Standard, Open CL
Power, Performance
High Speed, Scalable Interconnect: NoC’s
3-D Stacking
SoC Trends & Challenges
– Verification Effort
– IP Integration
– TLM, RTL Co-simulation challenges.
3. 3 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
TRENDS: THREE ERAS OF PROCESSOR PERFORMANCE
Single-Core
Era
Single-threadPerformance
?
Time
we are
here
o
Enabled by:
Moore’s Law
Voltage Scaling
MicroArchitecture
Constrained by:
Power
Complexity
Multi-Core
Era
ThroughputPerformance
Time
(# of Processors)
we are
here
o
Enabled by:
Moore’s Law
Desire for Throughput
20 years of SMP arch
Constrained by:
Power
Parallel SW availability
Scalability
Heterogeneous
Systems Era
TargetedApplication
Performance
Time
(Data-parallel exploitation)
we are
here
o
Enabled by:
Moore’s Law
Abundant data parallelism
Power efficient GPUs
Currently constrained by:
Programming models
Communication overheads
4. 4 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
TRENDS: EVOLUTION OF HETEROGENEOUS COMPUTINGArchitectureMaturity&ProgrammerAccessibility
PoorExcellent
2012 - 20202009 - 20112002 - 2008
Graphics & Proprietary
Driver-based APIs
Proprietary Drivers Era
“Adventurous” programmers
Exploit early programmable
“shader cores” in the GPU
Make your program look like
“graphics” to the GPU
CUDA™, Brook+, etc
OpenCL™, DirectCompute
Driver-based APIs
Standards Drivers Era
Expert programmers
C and C++ subsets
Compute centric APIs , data
types
Multiple address spaces with
explicit data movement
Specialized work queue based
structures
Kernel mode dispatch
Fusion™ System Architecture
GPU Peer Processor
Architected Era
Mainstream programmers
Full C++
GPU as a co-processor
Unified coherent address space
Task parallel runtimes
Nested Data Parallel programs
User mode dispatch
Pre-emption and context
switching
More uptodate information on FSA:
http://developer.amd.com/afds/pages/keynote.aspx#/Dev_AFDS_Reb_2
5. 5 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
FSA & OPEN STANDARD: ENTER FUSION
Dual Core CPU Northbridge DirectX®11 GPU
FUSION APU
(Accelerated Processing Unit)
Heterogeneous compute engine combining
x86 compute and parallel processing
capabilities of the GPU on a single die
6. 6 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
FSA & OPEN STANDARD: WHY FUSION?
6
Integrating CPUs, Northbridge and GPU enables:
– Unified Memory
– High-bandwidth, low latency access by GPU
– Saves on interface power and PHY area
– Shared Power Control and TDP envelope
Potential bandwidth bottleneck
Relatively long memory latency
7. 7 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
COMMITTED TO OPEN STANDARDS
AMD drives open and de-facto
standards
– Compete on the best
implementation
Open standards are the basis for
large ecosystems
Open standards always win over
time
– SW developers want their
applications to run on multiple
platforms from multiple
hardware vendors
DirectX®
8. 8 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
OPENCL™ AND FSA
FSA is an optimized platform
architecture for OpenCL™
– Not an alternative to OpenCL™
OpenCL™ on FSA will benefit from
– Avoidance of wasteful copies
– Low latency dispatch
– Improved memory model
– Shared pointers
FSA also exposes a lower level
programming interface, for those
that want the ultimate in control
and performance
Optimized libraries may choose
the lower level interface
10. 10 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
POWER-THERMAL EFFECTS IN SYSTEMS ON CHIPS
¡ Local failures !
Part not working
Complex SoCs: High power density
Non-uniform power dissipation: Hotspots
Spatial gradients: Cause malfunctions
High on-chip temperatures cause
malfunctions affecting reliability.
Power consumption depends on
frequency
Setting frequencies to control power and
temperature
11. 11 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
OPTIONS FOR POWER SAVINGS
Convergence of Performance and Low Power
– Notebook->Netbook-> Tablet
Tablet<-Smartphone
12. 12 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
PERFORMANCE AND POWER
S3 idle Static
Screen
MM07 Media
Playback
Full
Compute
APU Power vs. Use Case
Performance
Power
Performance versus Power Efficiency
Power Management versus Power reduction
Performance & Thermal Design Power
14. 14 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
NOC’S: FROM BUSES TO NETWORKS:
[Friedman Harel:10]
Note: This slide presents industry specific information does not relate to AMD NoC status
15. 15 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
NOC CHALLENGES: CAD TOOLS
Capturing application traffic.
Which Topology ?
Mapping? Routes to use?
Fixing communication
architecture : parameters.
Verification for correctness, performance.
Build models.
QoS under un-reliable conditions.
Key to success: Automate & integrate the steps.
Mesh Topology
homogeneous systems, with
regular tiles
Customized Topology
heterogeneous systems, with
different cores & irregular FP
Software Services
Mapping, QoS, middleware...
Architecture
Packeting, buffering, flow control...
Physical Implementation
Synchronization, wires, power...
CAD Tools
16. 16 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
Synchronous Delay Insensitive
Global None
Timing Assumptions
Less Detection
Local Clocks, Interaction
with data (becoming aperiodic)
A complete spectrum of approaches to system-timing exist
[Mullins06-07]
NOC CHALLENGES: BEYOND GLOBAL SYNCHRONY
Delay Insensitive
18. 18 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
3-D STACKING
Supporting Heterogeneous computing: high density, high performance,
high memory B.W requirement.
3-D NoC’s option
Futuristic view:
Integrating Bio-sensor
Note:
This slide presents industry specific information does not relate to AMD 3-D stacking status
20. 20 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
WHAT’S NEW IN SOC DESIGN?
Larger and more complex chips with heavy use of pre-existing cores.
Heavy use of multi core processors and DSPs.
Complex Interconnect.
Shorter time to market and Smaller design teams.
… and software.
Leads to:
– Increased verification effort: Debugging is harder.
– Integration is more difficult.
– Need for scalable and high speed interconnect.
– SW / HW co-simulation is a major issue.
– Power –Performance challenge.
– How do we treat the system software?
21. 21 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
VERIFICATION EFFORT
Debugging
– Seamless debug across
h/w and software[especially SW]
Testbench Development:
– Several methodologies
VMM,OVMUVM.
New developments
[Unified strategy]
– UCIS,UVM TLM2.0
– Coverage trend
Address Gaps in VHDL,
System C coverage
22. 22 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
VERIFICATION EFFORT
Creating/Running Testcase:
– Direct & Random
– Run time improvement
Save-restore.
Verification Cycle per second instead of Cycles per second:
Configuring environment to dynamically select relevant
design/core.
Alternate options
23. 23 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
Emulation Focus Areas:
1. Tests/regression run with Long run time
2. Corner case bugs that may escape traditional verification
3. Replicating System level scenarios
Ongoing Initiatives/Need:
1.Seemless support for assertions.
2.Improve portability between Simulation & Emulation
3. Common model from TLM-HDL-Emulation
VERIFICATION EFFORT
Alternate Options
24. 24 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
IP INTEGRATION CHALLENGE
Integration of IP :
– Multiple IP’s, various configurations, design languages
– IP’s to be in Sync: macro’s , libraries.
– Complexity increases with mixed language designs
SYSTEM
C
SVLO
G
VERILOG
VHDL
Unique Strengths
of Languages
Diversity of Design
Teams
Importing Existing
IP
Legacy Testbench
Environment
25. 25 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
IP INTEGRATION CHALLENGE COMPARISON OF CHOICES
Direct
Instantiation
SV Bind
Construct
SystemC
Control/Observe
SCV-
Connect()
SC-DPI
Source Code
Available
Yes Yes Yes Yes Yes
One IP
Compiled
Yes Yes Yes Yes Yes
Both IP
Compiled
No Yes No No No
Performance ++++ (3) +++ (2) + (1) + (1) +++++(4)
Delta Delay Yes Yes No No No
Languages
Supported
SV, SC,
VHDL
SV, SC,
VHDL
SC + SV/VHDL
SC +
SV/VHDL
SC + SV
Gap: No standardized automated methodology for integration.
Recommended Approach:
• Understand IP blocks: language, source code availability.
• Understand connection: 1-1, distributed, method port
• Option for optimized solution to quickly build a system
26. 26 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
IP INTEGRATION CHALLENGE: GAPS WITH ANALOG IP
INTEGRATION IN SOC
Table1. Gaps with Analog IP Integration in SoC
Gaps Root Cause
Testchip setup
-Testchip scenario is different
-Tester used for testchip differs
Inbuilt debug
-Incomplete inbuilt SoC test/debug capability or derisk option for basic
functionality such as PLL clock
IP I/F verification -Incomplete test setup
Review process
-No common detailed review process between IP and SoC team. Incorrect
assumption based on past analog IP working silicon
IP Modelling
-Mismtach in version between IP simulation model and spice netlist
-Limitations of behavioral model to replicate actual analog IP functionality
-Timing issue
-DFT issue
EDA tools -Gaps in analog and digital simulation environment
27. 27 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
Verification Environment Bring-up
– Automated Assertions for early checks.
– Review forces, tie-off and relevant checkers from IP to SoC
– Bottleneck for SoC team to get started with verification: Option to use
fake model for initial bring up. Usage of system model.
– Super Block Concept: pre-verified IP blocks at similar frequency &
interface
Requirement:
Current solution: In-house methodology and process. No clear solution
from EDA vendors.
IP INTEGRATION CHALLENGE
IP
Block1
IP
Block2
Minimum
Manual
Effort
Hookup
Using ICU
No BUGS!
28. 28 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
TLM, RTL Co-simulation
Traditional use of System level models : Architecture profiling &
Performance Analysis
Increasing Demand for Co-simulation: Tradeoff between Accuracy and
Performance.
Open Challenges
Different level of Abstraction.
Need for improvement in Integration methodology and Test bench
development
Seamless Debug and Coverage methodology.
Using System Level model for HDL generation
Legacy system model not written with conversion in mind.
Current limitation: Incomplete translation.
Lack of reliable Equivalence Check tool.
Need: Merge top down (SystemC) and bottom-up (System Verilog)
methodology/flow.
Gaps/Work to do: How to do Power analysis
30. 30 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
REFERENCES
[1] Wilson Research Group-MGC study blog 2011.
[2] AMD Coolchip2011 presentation. Denis Foley, AMD Sr. Fellow.
[3] Fusion Processors and HPC-2011, Chuck Moore, AMD Corporate
Fellow & Technology Group CTO
[3] AMD Fusion Developer Summit 2011. Phil Rogers, AMD Corporate
Fellow
[4] Fully Asynchronous framework for GALS network on chip. Friedman H
[5]Future of EE, NoC’s presentation. Dr. Srinivasan Murali
[6] Analog IP integration in SoC, IP reuse’09. Mixed language IP integration
DVCoN 2010. Extending Fucntional coverage to SystemC, VHDL-IP’10.
Pankaj S
31. 31 | 9th Intl. SoC Conference| Nov 2nd,3rd, 2011
GLOSSARY
GPU – Graphics processing unit
APU: Accelerated Processing Unit
Open CL: Open Computing Language
TDP – Thermal Design power – a measure of a design
infrastructure’s ability to cool a device
NoC: Network On Chip
TLM: Transaction Level Modeling
Turbo Core – AMD boost mechanism
QoS: Quality of Service
UVM: Universal Verification Methodology
UCIS: Unified Coverage Interoperability Standard