The document discusses productive parallel programming for Intel Xeon Phi coprocessors. It begins by noting the continued need for increased computing power and discusses how Moore's law and hardware/software innovation can help achieve exascale computing. It then provides examples showing the performance benefits of Intel Xeon Phi coprocessors for applications like weather prediction and finite element analysis, with speedups of up to 2x. Reliability challenges at extreme scales are also discussed.
Productive parallel programming for intel xeon phi coprocessors
1. Productive Parallel Programming
for Intel® Xeon Phi™
Coprocessors
Bill Magro!
Director and Chief Technologist!
Technical Computing Software!
Intel Software & Services Group!
2. Still an Insatiable Need For Computing
Weather Prediction
1 ZFlops
100 EFlops
10 EFlops Genomics Research
1 EFlops
100 PFlops
10 PFlops Medical Imaging
1 PFlops
100 TFlops
10 TFlops
1 TFlops
100 GFlops
10 GFlops
1 GFlops Forecast
100 MFlops
1993 1999 2005 2011 2017 2023 2029
PetaFlop Systems of Today Are
The Client And Handheld Systems 10 years Later
Source: www.top500.org
4. Some believe…
• Virtually none of today’s hardware or software technologies
can be improved or modified to reach exascale
• A complete revolution is needed
We believe…
• Evolution of today’s technologies + hardware and software
innovation can get us there
• A systems approach – with co-design – is critical
5. Moore’s Law: Alive and Well
2003 2005 2007 2009
2011
90 nm 65 nm 45 nm 32 nm 22 22nm
nm A Revolutionary
Leap in
Process
SiGe SiGe
Technology
Invented
SiGe
2nd Gen.
SiGe
Invented
Gate-Last
2nd Gen.
Gate-Last
First to
Implement
37%
Strained Silicon Strained Silicon High-k High-k Tri-Gate
Performance Gain at
Metal Gate Metal Gate Low Voltage*
STRAINED SILICON
HIGH-k METAL GATE >50%
Active Power
Reduction at Constant
TRI-GATE
Performance*
The foundation for all computing… including Exa-Scale
Source: Intel
*Compared to Intel 32nm Technology
7. Myth: explicitly managed locality is in and caches are out!
Reality: Caches remain path to high performance and
efficiency
Relative BW Relative BW/Watt
50
45
40
35
30
25
20
15
10
5
0
Memory BW L2 Cache BW L1 Cache BW
8. #1 Green 500 Cluster
WORLD RECORD!
“Beacon” at NICS
Intel® Xeon® Processor +
Intel Xeon Phi™ Coprocessor Cluster
Most Power Efficient on the List
2.449 GigaFLOPS / Watt
70.1% efficiency
Other brands and names are the property of their respective owners.
Source: www.green500.org as of Nov 2012
9. Reaching Exascale Power Goals Requires
Architectural & Systems Focus
• Memory (2x-5x)
– New memory interfaces (optimized memory control and xfer)
– Extend DRAM with non-volatile memory
• Processor (10x-20x)
– Reducing data movement (functional reorganization, > 20x)
– Domain/Core power gating and aggressive voltage scaling
• Interconnect (2x-5x)
– More interconnect on package
– Replace long haul copper with integrated optics
• Data Center Energy Efficiencies (10%-20%)
– Higher operating temperature tolerance
– 480V to the rack and free air/water cooling efficiencies
9
10. Reliability of these machines requires a systems
approach
Extreme Parallelism
1E+07
Top System Concurrency Trend
1E+06
1E+05
1E+04
• Transparent process migration
1E+03
1E+02
’93 ‘95 ‘97 ‘99 ’01 ‘03 ‘05 ‘07 ‘09
• Holistic fault detection and recovery
MTTI Measured in Minutes
• Reliable end to end communications
nt • Integrated memory in storage layer for
MTTI (hours)
1000 ip Cou
Ch 1E+08
DRAM
100
fast checkpoint and workflow
Count
1E+07
10
1E+06
1
Socket Cou
nt 1E+05 • N+1 scale reliable architectures
0
2004 2006 2008 2010 2012 2014 2016
1E+04 consistent with stacked memory
0.1 Failures per socket per year:
Time to save a global
constraints
Global Checkpoint
Crossover point • System wide power management and
dynamic optimization
Time
• Must design for system level debug
capability.
Reliability is the primary force driving next generation designs
Source: Exascale Computing Study: Technology Challenges in achieving Exascale Systems (2008),
12. Architecture for Discovery
Intel® Xeon® processor
• Ground-breaking real-world application performance
• Industry-leading energy efficiency
• Meets broad set of HPC challenges
Intel® Xeon Phi™ product family
• Based on Intel® Many Integrated Core (MIC) architecture
• Leading performance for highly parallel workloads
• Common Intel Xeon programming model
• Productive solution for highly-parallel computing
12
13. Intel® Xeon® E5-2600 processors
Up to 73% performance boost
vs. prior gen1 on HPC suite of
applications
Over 2X improvement on key
industry benchmarks
Up to 4 channels
DDR3 1600 memory
Significantly reduce compute
time on large, complex data
Up to 8 cores
Up to 20 MB cache sets with Intel® Advanced
Vector Extensions
Integrated
PCI Express*
Integrated I/O cuts latency
while adding capacity &
bandwidth
1 Over previous generation Intel® processors. Intel internal estimate. For more legal information on performance forecasts go to http://www.intel.com/performance
13
14. Introducing Intel® Xeon Phi™ Coprocessors
Highly-parallel Processing for Unparalleled Discovery
Groundbreaking: differences
Up to 61 IA cores/1.1 GHz/ 244 Threads
Up to 8GB memory with up to 352 GB/s bandwidth
512-bit SIMD instructions
Linux operating system, IP addressable
Standard programming languages and tools
Leading to Groundbreaking results
Up to 1 TeraFlop/s double precision peak performance1
Enjoy up to 2.2x higher memory bandwidth than on an Intel®
Xeon® processor E5 family-based server.2
Up to 4x more performance per watt than with an Intel® Xeon®
processor E5 family-based server. 3
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
14 For more information go to http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details.
15. BIG GAINS FOR SELECT APPLICATIONS
8
7 Scale to many-
core
6
5
Vectorize
4
3 Parallelize
2
Performance 1
0.8
0
0% 10% 0.4
20% 30% 40% 50% 60% 70% 0
Fraction
80% 90%
100% Parallel
% Vector
* Theoretical acceleration using a highly-parallel Intel® Xeon Phi™ coprocessor
versus a standard multi-core Intel® Xeon® processor
18. PARALLELIZING FOR HIGH PERFORMANCE
Example: SAXPY
STARTING POINT
Typical serial code
running on multi-core
67.097 Current
Intel® Xeon® processors SECONDS
Performance
STEP 1.
OPTIMIZE CODE
Parallelize and vectorize 0.46 145X
SECONDS
code and continue to run on FASTER
multi-core Intel Xeon processors
USE COPROCESSORS
STEP 2.
0.197
SECONDS
2.3X
FASTER
340X
Run all or part of the
optimized code on Intel®
Xeon Phi™ coprocessors FASTER
20. PROVEN PERFORMANCE BENEFITS
Intel® Xeon Phi™ Coprocessor
Finite Element Analysis
Sandia National UP TO
Labs MiniFE2
2X
Seismic
Acceleware 8th UP TO
Order Isotropic
Variable Velocity1 2.23X
China Oil & Gas UP TO
Geoeast Pre-stack
Time Migration3 2.05X
1. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted)
2. 8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero)
3. 2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload)
20 INTEL CONFIDENTIAL
21. PROVEN PERFORMANCE BENEFITS
Intel® Xeon Phi™ Coprocessor
Embree Ray Tracing
Intel Labs SPEED-UP
Ray Tracing3
1.8X
Physics
Jefferson Lab UP TO
Lattice QCD
2.7X
Finance
Black-Scholes SP2 UP TO 7X
Monte Carlo SP2 UP TO
10.75X
1. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted)
2. Includes additional FLOPS from transcendental function unit
3. Intel Measured Oct. 2012
21 INTEL CONFIDENTIAL
32. Intel® Xeon Phi™ Coprocessor- Game Changer for HPC
Build your applications on a known compute platform…
and watch them take off sooner.
With restrictive special
purpose hardware
“We ported millions of
lines of code in only days
Complex
and completed accurate
New learning code porting runs. Unparalleled
productivity…
most of this software does
not run on a GPU and
With Intel® Xeon Phi™ never will”.
Coprocessor
— Robert Harrison,
National Institute for
Computational Sciences,
Familiar tools & runtimes Oak Ridge National
Laboratory
7Harrison, Robert. “Opportunities and Challenges Posed by Exascale Computing—ORNL's Plans and Perspectives.” National Institute of
Computational Sciences (NICS), 2011.
32
38. Programming Intel® Xeon Phi™ based
Systems (MPI+Offload)
• MPI ranks on Intel® Xeon®
Offload
processors (only)
• All messages into/out of Xeon
Data
processors
CPU
MIC
• Offload models used to MPI
accelerate MPI ranks
Data
• TBB, OpenMP*, Cilk Plus,
Pthreads within coprocessor CPU
MIC
Network
• Homogenous network of hybrid
nodes:
Data
CPU
MIC
Data
CPU
MIC
41. Programming Intel® Xeon Phi™ based
Systems (MIC Native)
• MPI ranks on Intel MIC (only)
• All messages into/out of
Data
coprocessor
• TBB, OpenMP*, Cilk Plus, CPU
MIC
MPI
Pthreads used directly within
MPI processes Data
CPU
MIC
• Programmed as homogenous Network
network of many-core CPUs:
Data
CPU
MIC
Data
CPU
MIC
43. Programming Intel® MIC-based Systems
(Symmetric)
• MPI ranks on coprocessor and
Intel® Xeon® processors
Data
Data
• Messages to/from any core
• TBB, OpenMP*, Cilk Plus, MPI
CPU
MIC
MPI
Pthreads used directly within
MPI processes Data
Data
CPU
MIC
• Programmed as heterogeneous Network
network of homogeneous
nodes:
Data
Data
CPU
MIC
Data
Data
CPU
MIC
47. Keys to Productive Parallel Performance
Determine the best platform target for your application
• Intel® Xeon® processors or Intel® Xeon Phi™ coprocessors
- or both
Choose the right Xeon-centric or MIC-centric model for
your application
Vectorize your application
Parallelize your application
• With MPI (or other multi-process model)
• With threads (via Pthreads, TBB, Cilk Plus, OpenMP,
etc.)
• Go asynchronous: overlap computation and
communication
Maintain unified source code for CPU and