El documento describe la transición de la industria de los procesadores hacia arquitecturas multicore. Explica que debido a los límites de frecuencia y consumo de energía de los procesadores de núcleo único, las empresas están desarrollando procesadores con múltiples núcleos para mejorar el rendimiento. También describe cinco innovaciones clave de la microarquitectura Intel Core que mejoran el rendimiento y la eficiencia energética.
2. Una Nueva Revolución está aquí Recordando: “ El nuevo Procesador Pentium de Intel revolucionará la industria de la PC. “ Reuters News | Marzo 22, 1993
3. La Necesidad de Mayor Rendimiento 1980 1990 1995 Inalámbrica Plug’n Play Video Input PVR 2004 Primera PC presentada en 1981 Windows Mouse Monitor Color Internet Multimedia Joystick Parlantes 2006 Multitarea Menor Consumo de Energía Móvil Juegos
4. Un Cambio de Paradigma De 1 o MHz a Performance/Watt Documento Interno de Intel Septiembre 1999
5. Desempeño a través de Paralelismo: “Multi-Cores” Normalized Performance vs. Initial Intel ® Pentium ® 4 Processor 2000 2008+ Desempeño 2004 3X Source: Intel
6. Desempeño a través de Paralelismo: “ Multi-Cores” 2000 2008+ Desempeño 10X SINGLE CORE MULTI-CORE 2004 3X FORECAST Aquí estamos Source: Intel Normalized Performance vs. Initial Intel ® Pentium ® 4 Processor
7. Tendencia hacia múltiples núcleos (o “cores”) Multi Processor Hyper-threading Dual-Core Multi-Core Many-Core
8. Por qué Multicore? Max Frequency Consumo de energía Desempeño 1.00x Relative single-core frequency and Vcc
9. “ Over-clocking” Over-clocked (+20%) Relative single-core frequency and Vcc 1.73x 1.13x 1.00x Max Frequency Consumo de Energía Desempeño
10. “ Under-clocking” Over-clocked (+20%) Under-clocked (-20%) 0.51x 0.87x 1.00x Relative single-core frequency and Vcc 1.73x 1.13x Max Frequency Consumo de Energía Desempeño
11. Over-clocked (+20%) 1.00x Relative single-core frequency and Vcc 1.73x 1.13x Max Frequency Energía Desempeño Desempeño Multi-Core y Eficiencia de Energía Dual-core (-20%) 1.02x 1.73x Dual-Core
13. Callejón sin salida? Relative Scalar Performance Más Energía por Instrucción Pentium 4 (2005) Pentium 4 (2001) Pentium (1993) Pentium Pro (1995) Energía por instrucción (nJ/instr)
14. Atravesando la Muralla de Energía Energía por instrucción (nJ/instr) Menor consumo de energía por instrucción Pentium-M (2003) Pentium-M (2005) Core Duo (2006) Pentium 4 (2005) Pentium 4 (2001) Pentium (1993) Pentium Pro (1995) Relative Scalar Performance
15. *Not representative of actual die photo or relative size La Microarquitectura Intel ® Core™ 2006 NetBurst ® Microarchitecture Mobile Microarchitecture + Innovaciones Intel ® Wide Dynamic Execution Intel ® Advanced Digital Media Boost Intel ® Smart Memory Access Intel ® Advanced Smart Cache Intel ® Intelligent Power Capability Intel ® Core™ Microarchitecture
16. Intel ® Wide Dynamic Execution Intel ® Advanced Digital Media Boost Intel ® Intelligent Power Capability Intel ® Smart Memory Access Intel ® Advanced Smart Cache Intel® Core™
17. Intel ® Advanced Digital Media Boost Intel ® Wide Dynamic Execution Intel ® Smart Memory Access Intel ® Advanced Smart Cache Cinco Innovaciones Claves 4-wide 14-stage pipeline Micro-fusion Macro-fusion Intel ® Intelligent Power Capability
18. Intel ® Smart Memory Access Intel ® Advanced Smart Cache Cinco Innovaciones Claves Intel ® Intelligent Power Capability Single-cycle 128-bit SSE Intel® Wide Dynamic Execution Intel ® Advanced Digital Media Boost
19. Cinco Innovaciones Claves Intel ® Advanced Smart Cache Intel ® Advanced Digital Media Boost Intel ® Smart Memory Access Intel ® Intelligent Power Capability Shared L2 cache Intel® Wide Dynamic Execution
20. Intel ® Advanced Digital Media Boost Intel ® Advanced Smart Cache Cinco Innovaciones Claves Intel ® Smart Memory Access Intel ® Intelligent Power Capability Advanced Pre-fetch Memory Disambiguation Intel® Wide Dynamic Execution
21. Intel ® Advanced Digital Media Boost Intel ® Smart Memory Access Cinco Innovaciones Claves Intel ® Intelligent Power Capability Advanced Power Gating Intel® Wide Dynamic Execution Intel ® Advanced Smart Cache
22.
23. Resultados en RenderMan de Pixar Other names and brands may be claimed as the property of others. ~5x Multi Threaded Arquitectura de última generación Un solo núcleo Una sola línea de instrucciones 1 hr 27 mins 7 hr 7 min
24. Resultados en RenderMan de Pixar Other names and brands may be claimed as the property of others. 0.67 kw hrs @ Platform ~1/3 ~5x 1 hr 27 mins 1.95 kw hrs @ Platform 7 hr 7 min Un solo núcleo Una sola línea de instrucciones Multi Threaded Arquitectura de última generación
25. Desempeño DESKTOP MOBILE Source: Intel Desktop based on SPECint*_rate_base2000 (2 copies) comparing Intel® Core™2 Duo E6700 to Intel® Pentium® D Processor 960. Mobile based on SPECfp*_rate_base2000 and SPECint*_rate_base2000 comparing Intel® Pentium® M Processor 780 and 750 and Intel® Core™ Duo Processor T2600 with Intel® Core™2 Duo Processor T7600 and T5600. >40% >100%
UN CAMBIO DE PARADIGMA Desde los MHZ ante todo a la relación Rendimiento/Watt RENDIMIENTO/CONSUMO RENDIMIENTO INT CONSUMO TDP
Note: Slide is a build Let take a look at the projections of compute performance growth over time using an average of Specint200 RATE and SpecFP2000 RATE (note RATE is important to state), normalized to P4P at introduction From 2000 to today we have seen a 3x growth. As we look forward over the next four year, our projections show if we maintained single core, we probably could achieve another 3x growth while still fitting into the power envelopes of today platforms. By moving in the direction of HW enhanced parallelism with Dual and other multi-core, we see the compute throughput capability growing by 10x over the next 4 years while fitting in the existing platforms A few notes: HT does not show benefit w/ Spec Rate Over next 4 years, we will increase performance at a faster rate than any other time in the history of Moore’s Law.
Note: Slide is a build Let take a look at the projections of compute performance growth over time using an average of Specint200 RATE and SpecFP2000 RATE (note RATE is important to state), normalized to P4P at introduction From 2000 to today we have seen a 3x growth. As we look forward over the next four year, our projections show if we maintained single core, we probably could achieve another 3x growth while still fitting into the power envelopes of today platforms. By moving in the direction of HW enhanced parallelism with Dual and other multi-core, we see the compute throughput capability growing by 10x over the next 4 years while fitting in the existing platforms A few notes: HT does not show benefit w/ Spec Rate Over next 4 years, we will increase performance at a faster rate than any other time in the history of Moore’s Law.
There are a number of new capabilities that we are bringing into the new micro-architecture to address our goals of 1)delivering the new level of compute performance ( relative to existing IA-32 processors), 2)ensuring scalability and optimization into each of the product lines and delivering this performance in a very power efficient manner The New micro-architecture will build upon key features from the NetBurst and the Banias micro-architectures and also bring in a number of new and innovative features. I will touch upon a few of the elements briefly and you can expect a more thorough review at Spring IDF as we get a little closer to introduction The key elements of this uArch are: Higher Performance Out of Order (OOO) Execution Engine** Wider (4 issue); deeper buffers; 14 stage efficient pipeline Wider (4 issue): A full 4 wide super scalar pipeline that can decode, execute and retire instructions at a sustained rate of 4 instructions per clock Q: Are there any instruction limitations in the 4 wide pipeline? A: We will go into more details about the 4 wide pipeline as we get closer to launch Deeper Buffers: The buffer sizes have been optimized to optimize the effective number of instructions in flight relative to the pipeline This enables the processor to look deeper into the program flow to find the instructions that can be executed in parallel Q: Can you give me details on the number of buffers? A: We will go into more details about the buffer depths as we get closer to launch 14 Stage Efficient Pipeline: A very short and efficient 14 stage pipeline that attributes to higher frequency relative to Banias and while delivering higher IPC Q: Can you tell me any more information about the Pipeline stages? A: We will go into more details about the pipeline stages as we get closer to launch Advanced Power Capability** The new uArch will gain this performance without paying with excess power. The main philosophy here is that devices are only powered on when needed vs. a philosophy of turning devices off when not needed. The result is both a lower TDP and a lower average power Power gating is used extensively ensuring that on uArch devices are only on when they need to be. Additionally, many buses and arrays are split so that data required in some modes of operation can be put in a low power state when not needed. In fact, the new uArch implements a superset of the power states from the Banias uArch product line. Q: Can you tell me any more information about the features and how they differ from Banias/Pentium M? A: We will go into more details on the power features as we get closer to launch Q: Can you tell me any more information about the specific average runtime power that you expect? A: We will go into more details on the average runtime power features as we get closer to launch. Multi-core Enhanced Cache System** Shared & scalable L2 Cache Shared L2: Like Yonah before it, the new uArch also uses a shared L2 cache that can enable better performance by 1) enabling memory intensive apps with the opportunity to see a larger cache when running with less memory intensive apps as well as 2) enabling the ability for data in the L2 cache to be shared between the cores without having to go out on the FSB, lowering FSB utilization and finally, 3) in single threaded environments the single threads running on a single core will have the ability to utilize the full cache. Q: What is the cache access time for the L2 cache? A: We will go into more details about the caches as we get closer to launch Scalable: the L2 cache is very scaleable to meet the individual requirements of each of the segments. It will support the needs of the low end mobile and desktop designs, all the way to the high end MP server designs Q: Can you tell me the cache sizes that are enabled? A: We have the ability to scale very well to meet the individual needs of each of the targeted mobile, server and desktop market segments. We will go into more details about the pipeline stages as we get closer to launch Q: Does Cache latency degrade with the larger caches? A: We will talk about specific details as we get closer to launch of the products based upon the new micro-architecture Direct L1 to L1 cache transfer; Direct L1 to L1 cache transfer: It can perform high speed, direct L1 to L1 cache transfers when one core requests data from the other core. As a result, it does not have to go out onto the FSB. This significantly reduces the latency to get the needed data into the core. Q: What is the cache access time for the L2 cache? A: We will go into more details about the caches as we get closer to launch Higher relative L2-core BW; Higher L2 to core BW : Improved L2 cache to core bandwidth that has been increased to a sustained rate of 2 cycles per cache line Q: What is the bandwidth of the L2 cache? A: We will go into more details about the caches as we get closer to launch Improved Memory Access With the new micro-architecture, we provided a number of features that help improve the effective bandwidth from the memory subsystem Improved Pre-fetch: The new uArch has a variety of new HW prefetchers both within the cores and within the L2 allowing many array and vector codes to continuously stream data from memory without suffering from memory delay or requiring specific SW pre-fetch instructions. Q: Can you tell me more data about the pre-fetchers? A: We will go into more details as we get closer to launch Memory Disambiguation: Memory Disambiguation allows for more of the benefits of Out Of Order execution to be felt by memory operations as the loads can be decoupled from the stores. With Memory Disambiguation, data loads do not have to wait for data stores to be completed but rather can intelligently speculate whether there will be a conflict between and pending load and pending store and if not, then the data load will speculatively occur first. The machine then has the ability to check back to determine if it speculated correctly and if so, can retire the result , if not, the machine will re-load and execute with the correct data. Q: Can you tell me more data about memory disambiguation? A: We will go into more details as we get closer to launch Q; How accurate is the Memory disambiguation? A: We will talk about more details as we get closer to launch. Macro Fusion: Delivers a the novel concept of macro fusion where it can combine the commonly used instruction sequence of “compare and jump” into a single instruction for execution. This reduces the internal resources required for the 2 instructions and increases IPC and enables the possibility to retire 5 instructions with the same work as 4, for a 25% increase in efficiency.
Benchmark here is multi-threaded spec int. (Measures throughput, or amount of work completed by CPU, given in a measured unit of time.) This are early estimates based on pre-production silicon and compare to our best available products today (excluding extreme edition).