Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

OpenPOWER Webinar

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 21 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie OpenPOWER Webinar (20)

Anzeige

Weitere von Ganesan Narayanasamy (20)

Aktuellste (20)

Anzeige

OpenPOWER Webinar

  1. 1. POWER9 Features and Strategies for improving Application Performance on POWER9 with IBM XL and Open Source compilers Archana Ravindar LLVM Compiler Performance(POWER Systems Performance), ISDL aravind5@in.ibm.com https://in.linkedin.com/in/archana-ravindar-0259625b
  2. 2. Scope of the Presentation • Review POWER9 processor features • Outline common bottlenecks encountered due to certain program characteristics • How to Identify these issues using tools on POWER9 Linux • What compiler Options can be used to reduce the impact of these characteristics • How can we code programs that prevent such situations to arise • POWER Linux platform • Compilers- XL, gcc wherever applicable • Performance Tools- perf
  3. 3. POWER Processor Technology Roadmap 2H12 POWER7+ 32 nm - 2.5x Larger L3 cache - On-die acceleration - Zero-power core idle state - Up to 12 Cores - SMT8 - CAPI Acceleration - High Bandwidth GPU Attach 1H14 – 2H161H10 POWER7 45 nm - 8 Cores - SMT4 - eDRAM L3 Cache POWER9 Family 14nm POWER8 Family 22nm 3 Enterprise Enterprise Enterprise & Big Data Optimized 2H17 – 2H18+ Built for the Cognitive Era− Enhanced Core and Chip Architecture Optimized for Emerging Workloads − Processor Family with Scale-Up and Scale-Out Optimized Silicon − Premier Platform for Accelerated Computing
  4. 4. POWER9 Family – Deep Workload Optimizations Emerging Analytics, AI, Cognitive - New core for stronger thread performance - Delivers 2x compute resource per socket - Built for acceleration – OpenPOWER solution enablement Technical / HPC - Highest bandwidth GPU attach - Advanced GPU/CPU interaction and memory sharing - High bandwidth direct attach memory Cloud / HSDC - Power / Packaging / Cost optimizations for a range of platforms - Superior virtualization features: security, power management, QoS, interrupt - State of the art IO technology for network and storage performance Enterprise - Large, flat, Scale-Up Systems - Buffered memory for maximum capacity - Leading RAS - Improved caching DB2 BLU 4
  5. 5. POWER9 Core Execution Slice Microarchitecture 128b Super-slice 64b Slice POWER9 SMT8 Core Modular Execution Slices Re-factored Core Provides Improved Efficiency & Workload Alignment • Enhanced pipeline efficiency with modular execution and intelligent pipeline control • Increased pipeline utilization with symmetric data-type engines: Fixed, Float, 128b, SIMD • Shared compute resource optimizes data-type interchange POWER8 SMT8 Core 5 POWER9 SMT4 Core
  6. 6. Shorter Pipelines with Reduced Disruption Improved application performance for modern codes • Shorten fetch to compute by 5 cycles • Advanced branch prediction Higher performance and pipeline utilization • Improved instruction management – Removed instruction grouping and reduced cracking – Complete up to 128 (64 – SMT4 Core) instructions per cycle Reduced latency and improved scalability • Local pipe control of load/store operations – Improved hazard avoidance – Local recycles – reduced hazard disruption – Improved lock management POWER9 Core Pipeline Efficiency 6
  7. 7. 7 POWER ISA v3.0 Broader data type support • 128-bit IEEE 754 Quad-Precision Float – Full width quad-precision for financial and security applications • Expanded BCD and 128b Decimal Integer – For database and native analytics • Half-Precision Float Conversion – Optimized for accelerator bandwidth and data exchange Support Emerging Algorithms • Enhanced Arithmetic and SIMD • Random Number Generation Instruction Accelerate Emerging Workloads • Memory Atomics – For high scale data-centric applications • Hardware Assisted Garbage Collection – Optimize response time of interpretive languages Cloud Optimization • Enhanced Translation Architecture – Optimized for Linux • New Interrupt Architecture – Automated partition routing for extreme virtualization • Enhanced Accelerator Virtualization • Hardware Enforced Trusted Execution Energy & Frequency Management • POWER9 Workload Optimized Frequency – Manage energy between threads and cores with reduced wakeup latency New Instruction Set Architecture Implemented on POWER9
  8. 8. 8 Acceleration Super Highway  5.6x more data throughput vs. PCIe Gen3 with NVIDIA NVLink optimization to the core  2x bandwidth with PCIe Gen4 vs. PCIe Gen3  Access up to 2TB of system memory delivered with coherence … only on POWER!  Superior data transfer to multiple devices 25G Links to OpenCAPI GPU devices  GPU  CPU and GPUGPU speed-up
  9. 9. 9 Scope of the Compiler  Compiler is an important layer in the system stack that is crucial for application performance  The compiler is intimately aware of the processor design and has functionality implemented keeping in mind the various latencies of the hardware units and movement of instructions within the pipe  The compiler is designed to emit appropriate ISA depending on which architecture a program is compiled for  Based on the architecture scheduling is done to ensure smooth flow of instructions through the pipe.  IBM XL is a proprietary compiler which was a pioneer in several optimization innovation over the past 3 decades.  Increasingly IBM has embraced open source compilers such as GCC, LLVM to leverage community participation and innovation.  The scope of this presentation focuses on how we can leverage IBM XL and open source compilers to obtain optimum performance on POWER9
  10. 10. Tools that we use in the Discussion • Compilers – IBM proprietary compilers - xlC/xlc/xlf – xlc -O[n] program.c –o program : n ranges from 0 to 5 – Some common options: -qhot (array intensive programs), -qtune=pwr9, -qsimd (enable SIMD) etc – Profile directed feedback (-qpdf1, -qpdf2) – Open source compilers: GCC, LLVM – -O[n]: n ranges from 0-3, Ofast – Common options -march=power9 – Profile directed feedback (-fprofile-generate, -fprofile-use) • Perf tool – To record hotspots/profile application • perf record -e r<code> ./binary args > out (produces perf.data) • perf report (opens profile report stored in perf.data) – To measure hardware events • perf stat –e r<code> ./binary args > out – For more details, refer perf manpage
  11. 11. Processor can be thought of containing two components •Front end ensures a smooth supply of instructions to be executed to the Backend •The Backend is concerned only with the execution of instructions •Code that has *too many* branches can cause processor to fetch more instructions than required and affect performance Front end Back end
  12. 12. Branches • Branches are predicted much in advance as the time needed to resolve the condition takes time introducing a bubble in the pipeline slowing down execution • POWER9 has an advanced branch predictor that uses complex structures to track context-based branch histories and does a very good job of predicting them accurately. However certain applications which are coded in a complex way can continue to cause high mispredictions • Wrong prediction- Misprediction – Counters to detect this: PM_BR_MPRED*,PM_FLUSH_BR_MPRED – Use perf stat –e r<code> ./program arguments > out to collect various counters • Branches are caused even by function calls, Such branches affect instruction cache locality and increase instruction cache misses – Counters to detect this: PM_L1_ICACHE_MISS • Branches within loops hinder vectorization/SIMD opportunities
  13. 13. Guidelines to reduce branches • Options to reduce loop /call branches – #pragma unroll(N) or (XL) -qunroll : Unrolling loops (GCC/LLVM: -funroll-loops compiler flag) (reduces loop branches) – (XL) -qinline=auto:level=<N> (N=1, .. 10) Inlining routines (will reduce function call jump/return) – Corresponding GCC/LLVM compiler option: -finline-functions • Loop Versioning: Slow version (that contains branches) + Fast version of loops (that does not contain branches) (Usually done automatically by compilers at higher levels of optimization) • Provide hints in source code to indicate the expected values of expressions appearing in branch conditions (long __builtin_expect(long expression, long value);) (hint whether branch is more likely to be taken/not) • If-conversion: Remove simple branches wherever possible by coding patterns such as if(val!=0) a=a+val; a+=val; if(val==0) a=a+1; a+=(!val)
  14. 14. Register Spills • In a RISC architecture, predominantly, instructions operate on registers – Load,store instructions used to transfer data from memory to registers • When #live variables > #available registers, spill is performed • 1 spill = 1 store + 1 load • *Spilling hot variables can hit performance* – Spills can cause Load Hit Stores (stores followed by load to the same address which may cause a delay in the pipe depending on the separating distance) – Spills increase Path length, address arithmetic instructions – Unnecessary reads/writes to memory • Issues due to to spills detected in following counters- PM_LSU_FIN, PM_LSU_FLUSH, PM_LSU_REJECT_LHS , PM_INST_CMPL, PM_FXU_FIN
  15. 15. Guidelines to reduce spills • Limit extensive unrolling/inlining that can cause long-live ranges of variables – Best to leave the compiler to do the inlining using its own heuristics • XL compiler option: -qcompact can help • Programs using mixed mode operands extensively (signed, unsigned) etc, conversion uses up extra registers • Use other register resources like SIMD registers if applicable, Use Vectorization wherever applicable/Code such that compiler vectorizes automatically • Use special POWER ISA instructions such as andc (logical AND complement), orc (logical OR complement) which combines multiple math operations in a single instruction saving a register; Compilers usually generate ISA when –march=power9, -qarch=pwr9 is used • (R3=R1 & !R2) – R4=not (R2) R3=R1 andc R2 – R3= R1 and R4
  16. 16. Memory Unit • Memory is organized in a hierarchy • L1 cache : Closest memory to the processor and the fastest, followed by L2, L3 upto main memory • Memory is most distant to the processor and slowest • Data cache : stores data, instruction cache: stores instructions • Data cache misses can stall load instructions in the pipeline causing a cascading effect on all those instructions dependent on it • Counters- PM_LD_MISS_L1, PM_CMPLU_STALL_DCACHE_MISS, PM_ST_MISS_L1, PM_CMPLU_STALL_DMISS_L2L3, PM_CMPLU_STALL_DMISS_LMEM etc L1 $ (3 cyc) L2 $ (15.5 cyc) L3 $ (35.5 cyc) Memory (74.5 ns)
  17. 17. Techniques to optimize memory performance • Memory footprint reduction wherever possible – If you have enums declared in your program, using –qenum=small allocates just one byte to enums v/s 4 bytes that gets allocated by default – Replace bytemaps(1 byte to store a '0' or a '1') by bitmaps wherever possible • Hardware prefetching – Controlled by DSCR settings – ppc64_cpu --dscr=<n> – Common DSCR configurations • 0 (all default values) • 0x1D7 (Achieve most aggressive depth, most quickly, enable stride N prefetch) • 1 (no prefetch) • POWER8 tuning guide has a detailed description of DSCR settings • Software prefetching – Programmer inserted prefetch instructions __dcbt, __dcbtst – Prefetch parameters can be tuned –qprefetch=aggressive:dscr=<value> – Available gcc prefetch options: -fprefetch-loop-arrays/-fno-prefetch-loop-arrays – If you want to explicitly control prefetching via software, you can turn off hardware prefetching using ppc64_cpu –dscr command(under root privileges)
  18. 18. 18 Flag Kind XL GCC/LLVM Can be simulated in source Benefit Drawbacks Unrolling -qunroll -funroll-loops #pragma unroll(N) Unrolls loops ; increases opportunities pertaining to scheduling for compiler Increases register pressure Inlining - qinline=auto:level= N -finline-functions Inline always attribute or manual inlining increases opportunities for scheduling; Reduces branches and loads/stores Increases register pressure; increases code size Enum small -qenum=small -fshort-enums -manual typedef Reduces memory footprint Can cause issues in alignment isel instructions -misel Using ?: operator generates isel instruction instead of branch; reduces pressure on branch predictor unit latency of isel is a bit higher; Use if branches are not predictable easily General tuning -qarch=pwr9, -qtune=pwr9 -mcpu=power9, -mtune=power9 Turns on platform specific tuning like ISA, scheduling 64bit compilation -q64 -m64 Prefetching - qprefetch[=aggressi ve] -fprefetch-loop-arrays __dcbt/__dcbtst, _builtin_prefetch reduces cache misses Can increase memory traffic particularly if prefetched values are not used Link time optimization -qipo -flto , -flto=thin Enables Interprocedural optimizations Can increase overall compilation time Profile directed feedback -qpdf1, -qpdf2 -fprofile-generate and –fprofile-use LLVM has an intermediate step llvm-profdata Enables hot path optimizations Requires a training run
  19. 19. 19 Hands-On Reference
  20. 20. Summary • Today we talked about – Various performance issues that can occur in an application on POWER9 linux – How to identify them ? – What can we do to improve performance during compilation ? – What can we do to improve performance while coding the application itself ? • We saw that Power9 has the most comprehensive set of hardware counters that enable analysts to understand applications of performance and get to the bottlenecks quickly • We saw that IBM XL compilers and equivalently open source compilers such as GCC, LLVM have a diverse set of options tailored to different needs to get required performance
  21. 21. References • POWER9 User Manual • https://openpowerfoundation.org/?resource_lib=power9- processor-users-manual • IBM XL Compiler reference http://www-01.ibm.com/support/docview.wss?uid=swg27036675 • POWER9 Raw event codes (Install libpfm) • https://github.com/torvalds/linux/blob/master/arch/powerpc/perf /power9-events-list.h • GCC 9.2 manual • https://devdocs.io/gcc~9/ • LLVM manual • https://llvm.org/docs/CommandGuide/

Hinweis der Redaktion

  • Memory enhancements, advances in graphic processing units (GPU), interconnects, and bandwidth all provide building blocks for a better performing AI architecture. In fact, the POWER9 AC922 marks what will become an industry requirement: welcome to the “off-chip” era (where advanced accelerators like GPUs and FPGAs are engineered to drive modern workloads) and the sunset of the “totally on-chip” era where processing is integrated on a single chip.

    POWER9 is the first commercial architecture loaded with NVIDIA’s next generation NVLink (AC922’s optimization isn’t just GPU to GPU like other commercial platforms, it also included GPU to CPU where it’s needed the most), OpenCAPI, and PCI-Express 4.0. Think of these technologies as a giant hose to transfer data.

    This slide shows a bit of a deeper look into what we are talking about when we say “Cutting Edge” and built for Enterprise AI.

    The AC922 combined with NVIDIA Next Generation NVLink technology provides 5.6x more data throughput when compared to PCIe Gen3. And since this server comes with PCIe Gen4, it should be noted that Gen4 delivers 2x the throughput when compared to PCIe Gen3’s bandwidth.

    Finally, the server delivers simplified execution for Enterprise AI with up to 2 TB of coherent memory for use in complex model building.
  • Memory enhancements, advances in graphic processing units (GPU), interconnects, and bandwidth all provide building blocks for a better performing AI architecture. In fact, the POWER9 AC922 marks what will become an industry requirement: welcome to the “off-chip” era (where advanced accelerators like GPUs and FPGAs are engineered to drive modern workloads) and the sunset of the “totally on-chip” era where processing is integrated on a single chip.

    POWER9 is the first commercial architecture loaded with NVIDIA’s next generation NVLink (AC922’s optimization isn’t just GPU to GPU like other commercial platforms, it also included GPU to CPU where it’s needed the most), OpenCAPI, and PCI-Express 4.0. Think of these technologies as a giant hose to transfer data.

    This slide shows a bit of a deeper look into what we are talking about when we say “Cutting Edge” and built for Enterprise AI.

    The AC922 combined with NVIDIA Next Generation NVLink technology provides 5.6x more data throughput when compared to PCIe Gen3. And since this server comes with PCIe Gen4, it should be noted that Gen4 delivers 2x the throughput when compared to PCIe Gen3’s bandwidth.

    Finally, the server delivers simplified execution for Enterprise AI with up to 2 TB of coherent memory for use in complex model building.

×