SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Introduction to the
Intel® Xeon Phi™
Coprocessor
Leo Borges (leonardo.borges@intel.com)
Intel - Software and Services Group
iStep-Brazil, August 2013
1
Click to edit Master title style
2
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Large Scale
Clusters
for Test &
Optimization
Tera-
Scale
Research
Leading
Performance,
Energy Efficient
Platform
Building
Blocks
Dedicated,
Renowned
Applications
Expertise
Broad Software
Tools
Portfolio
Defined
HPC
Application
Platform
Many
Integrated
Core
Architecture
Manufacturing
Process
Technologies
Exa-Scale Labs
A long term commitment to the HPC market segment
3
Intel in High-Performance Computing
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
HPC Processor Solutions
Common Intel Environment
Portable code, common tools
Xeon®
General Purpose Architecture
Leadership Per Core Performance
FP/core via AVX
Multi-Core Performance Intel® Xeon Phi™
Coprocessor
Trades a “big” IA core for
multiple lower performance
IA cores resulting in higher
performance for a subset of
highly parallel applications
EN
General
purpose
perf/watt
EP
Max perf/watt
w/ Higher
Memory BW /
freq and QPI
ideal for HPC
Xeon EX
Additional
sockets & big
memory
EP 4S
Additional
compute
density
Multi-Core Many-Core
4
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
5
Highly parallel and vectorized applications, or with
need for higher memory bandwidth, will run even
faster on Intel® Xeon Phi™ Coprocessors
Most applications will still run best on multi-
core Intel® Xeon® processors
Optimizing code often delivers significant
performance gains
RUNNING
EXISTING SERIAL
SOFTWARE
RUNNING
OPTIMIZED
SOFTWARE
Big Gains for Selected Applications
Medical imaging and
biophysics
Computer Aided Design
& Manufacturing
Climate modeling &
weather prediction
Financial analyses,
trading
Energy &oil
exploration
Digital content
creation
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
6
YES
Evaluating Your Applications
for Intel® Xeon Phi™
NO
YES
YES
YES
Can your workload
benefit from more
memory bandwidth?
Can your workload
benefit from
large vectors?
NO
NO
Can your workload
scale to over
100 threads?
Use Intel® Xeon Phi™ coprocessors for applications that scale with:
• Threads • Vectors • Memory Bandwidth
Click to edit Master title style
7
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
8
Intel Many Integrated Core (MIC, pronounced “Mike”)
Product Family/Architecture for Highly Parallel Applications
• Based on large number of smaller, low power, Intel Arch. Cores
• 512-bit wide vector engine
• Compliments Intel Xeon processor product line
• Provides breakthrough performance for highly parallel apps
– Familiar x86 programming model
– Same source code supports both Intel Xeon processor & Intel Xeon Phi coprocessor
– Initially a coprocessor with PCI Express form factor
First products announced at SC12: Code named Knights Corner (KNC)
• Up to 61 cores, 4 threads per core
• Up to 16GB GDDR5 memory (up to 352 GB/s)
• 225-300W (Cooling: Both passive & active SKUs)
• x16 PCIe Form-Factor (requires IA host)
8
Intel® Xeon® Phi™ Product Family
Based on the Intel MIC Architecture
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
9
Each Xeon Phi can be addressed as
an Individual Node in the Cluster
• 9
6 to 16 GB GDDR5 memory
INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
 Third level
o Fourth level
 Fifth level
Click to edit Master title style
10
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
3 Family
Outstanding Parallel Computing Solution
Performance/$ leadership
Intel® Xeon Phi™ Coprocessors
3120P 3120A
5 Family
Optimized for High Density Environments
Performance/watt leadership
5120D
7 Family
Highest Level of Features
Performance leadership
7120P 7120X
16GB GDDR5
352 GB/s
> 1.2 TFlops DP
Turbo
T
8GB GDDR5
>300 GB/s
>1 TFlops DP
6GB GDDR5
240 GB/s
>1 TFlops DP
5120P
Click to edit Master title style
11
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Performance Considerations
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References
12
Based on memory access and flops required
• Temporal/spatial locality of data
• Bandwidth Requirement
6 GB/s
Bandwidth
Limited
Core Limited
Stream-triad
BLAS1 & BLAS
2
All
Linpack
DGEMM
Mfg &
Scientific
Sparse
Matrix-
Vector
Scientific
SPECfp2000
All
Reservoir
Simulation
FTDT
Oil & Gas
Kirchhoff
Migration
Oil & Gas
Fluid Dynamics
Ocean Models
ScientificFFT
Oil & Gas
Mil HPC
(Y: Math Kernel; B: Applications; W: Segment)
Option
pricing
FSI
Molecular
Dynamic
Scientific
Application Characterization
RTM
Oil & Gas
INTEL CONFIDENTIAL13
75
171
0
50
100
150
200
STREAM
Triad (GB/s)
330
802
0
200
400
600
800
1000
SMP Linpack
(GF/s)
347
887
0
200
400
600
800
1000
DGEMM
(GF/s)
728
1,796
0
500
1000
1500
2000
SGEMM
(GF/s)
Notes
1. Intel® Xeon® Processor E5-2680 used for all SGEMM Matrix = 12800 x 12800 , DGEMM Matrix 10752 x
10752, SMP Linpack Matrix 26000 x 26000
2. Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold” SW stack SGEMM Matrix = 12800 x 12800,
DGEMM Matrix 12800 x 12800, SMP Linpack Matrix 26872 x 28672
3. Average single-node results from measurements across a set of nodes from the TACC+ Stampede* Cluster
+ Texas Advanced Computing Center (TACC) at the University of Texas at Austin.
++ Measured on the TACC+ Stampede Cluster
Coprocessor results: Benchmark run 100% on coprocessor,
no help from Intel® Xeon® processor host (aka native)
Synthetic Benchmarks
Intel® Xeon Phi™ Coprocessor and Intel® MKL
UP TO
2.4X
UP TO
2.5X
UP TO
2.2X
UP TO
2.4X
Higher is Better
• 2S Intel® Xeon®
• Intel Xeon Phi
ECC ON84% Efficient 83% Efficient 75% Efficient
INTEL CONFIDENTIAL
1.00
3.91
4.63
4.81
0.00
1.00
2.00
3.00
4.00
5.00
6.00
2S Intel® Xeon® Processor SMP Linpack DGEMM SGEMM
RelativePerformanceperWatt
(Normalizedto1.0Baselineofa2socketIntel®
Xeon®processorE5-2670)
Performance per Watt
Intel® Xeon Phi™ Coprocessor vs. 2S Intel® Xeon® processor (Intel MKL)
14
1 Intel® Xeon Phi™ Coprocessor
vs.
2 Socket Intel® Xeon® processor
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes.
For more information go to http://www.intel.com/performance
Notes:
1. 2 X Intel® Xeon® Processor E5-2670 (2.6GHz, 8C, 115W)
2. Intel® Xeon Phi™ coprocessor 5110P (ECC on) with Gold RC SW stack (Coprocessor power only)
Higher is Better
Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)
5110P
Click to edit Master title style
15
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Native, Offload and Variations
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References
INTEL CONFIDENTIAL
• Click to edit Master text styles
‒ Second level
 Third level
o Fourth level
 Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Wide Spectrum of Execution Models
General purpose
serial and parallel
computing
Codes with highly-
parallel phases
Highly-parallel
codes
Codes with
balanced needs
Main( )
Foo( )
MPI_*()
Foo( )
Main( )
Foo( )
MPI_*()
Main()
Foo( )
MPI_*()
Main( )
Foo( )
MPI_*()
Main( )
Foo( )
MPI_*()
Multicore
Many-core
Multicore Centric Many-core Centric
(Intel® Xeon® processors) (Intel® Many Integrated Core co-processors)
Multi-core-hosted Offload Symmetric Many-core-hosted
Range of Models to Meet Application Needs
16
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
The Intel Manycore Platform Software Stack
(MPSS) provides Linux on the coprocessor
17
Linux* OS
Intel® Xeon Phi™ Coprocessor
support libraries, tools, and
drivers
Linux* OS
PCI-E Bus PCI-E Bus
Intel® Xeon Phi™ Coprocessor
communication and application-
launch support
Intel® Xeon Phi™ CoprocessorHost Processor
System-level code System-level code
User-level codeUser-level code
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Runs either as an accelerator for offloaded
host computation…
18
Linux* OS
Intel® Xeon Phi™ Coprocessor
support libraries, tools, and
drivers
Linux* OS
PCI-E Bus PCI-E Bus
Intel® Xeon Phi™ Coprocessor
communication and application-
launch support
Intel® Xeon Phi™ CoprocessorHost Processor
System-level code System-level code
User-level codeUser-level code
Offload libraries, user-level
driver, user-accessible APIs
and libraries
User code
Host-side offload application
User code
Offload libraries,
user-accessible
APIs and libraries
Target-side offload
applicationAdvantages
• More memory available
• Better file access
• Host better on serial code
• Better uses resources
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
…Or runs as a native or
MPI* compute node via IP or OFED
19
Linux* OS
Intel® Xeon Phi™ Coprocessor
support libraries, tools, and
drivers
Linux* OS
PCI-E Bus PCI-E Bus
Intel® Xeon Phi™ Coprocessor
communication and application-
launch support
Intel® Xeon Phi™ CoprocessorHost Processor
System-level code System-level code
User-level codeUser-level code
Advantages
• Simpler model
• No directives
• Easier port
• Good kernel test
ssh or telnet
connection to
coprocessor IP
address
Virtual terminal session
Use if
• Not serial
• Modest memory
• Complex code
Target-side “native”
application
User code
Standard OS libraries
plus any 3rd-party or
Intel libraries
IB fabric
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Flexible: Enables Multiple Programming
Models
20
CPU MIC
CPU MIC
Data
MPI
Data
Network
Homogenous network
of many-core CPUs
CPU MIC
CPU MIC
Data
MPI
Data
Network
Data
Data
Heterogeneous network
of homogeneous CPUs
CPU MIC
CPU MIC
MPI
Offload
Offload
Network
Data
Data
Homogenous network of
heterogeneous nodes
Coprocessor only Host+Offload Symmetric
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Advisor XE
VTune Amplifier XE
Inspector XE
Trace Analyzer
Code Analysis
Comprehensive set of SW tools for Xeon
and Xeon Phi Programing
Intel Cilk Plus
Threading Building
Blocks
OpenMP
OpenCL
MPI
Offload/Native/MYO
Programming
Models
Math Kernel Library
Integrated Performance
Primitives
Intel Compilers
Libraries &
Compilers
21
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Click to edit Master title style
First Level
• Second level
– Third level
– Fourth level
– Fifth level
INTEL CONFIDENTIAL
22
• Click to edit Master text styles
‒ Second level
 Third level
o Fourth level
 Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Options for Thread Parallelism
Intel® Math Kernel Library
OpenMP*
Intel® Threading Building Blocks
Intel® Cilk™ Plus
OpenCL*
Pthreads* and other threading libraries Programmer control
Ease of use / code
maintainability
Choice of unified programming to target Intel® Xeon® and Intel® Xeon Phi™ Architecture!
22
Click to edit Master title style
23
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References
INTEL CONFIDENTIAL24
145X
FASTER
0.46
SECONDS
STEP 1.
OPTIMIZE CODE
Parallelize and vectorize
code and continue to run on
multi-core Intel Xeon processors
67.097
SECONDS
Current
Performance
STARTING POINT
Unoptimized serial code
running on multi-core
Intel® Xeon® processors
2.3X
FASTER
0.197
SECONDS
STEP 2.
USE COPROCESSORS
Run all or part of the
optimized code on Intel®
Xeon Phi™ coprocessors
The Following Performance Results are Based
on Already Optimized Code
SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012
Example: A Two-Step Process with SAXPY
Parallelizing for High Performance
340X
FASTER
INTEL CONFIDENTIAL
• Application: Hybrid Monte-Carlo program that
simulates lattice QCD with dynamical Wilson fermions.
It is one of the main production programs of the
QCDSF collaboration (DEISA) and beyond used for
quark simulation.
• Status: Many optimizations already in released
version; more optimizations and alternative offload
model version in development
• Demonstrated Results:
- No source code changes
- Recompiled, selected run-time parameters to get
maximum performance
25
Performance Proof-Point: Government and Academic Research
BQCD
“The performance improvement for BQCD using the
Intel Xeon Phi coprocessor was reached in record
time, requiring only recompilation. We are confident
that larger speed-ups can be obtained with modest
modifications of the code.”
Prof. Dr. Tilo Wettig
Principal Investigator of the QPACE project
BQCD Scalability Gflops/Sec
(Higher is Better)
0
50
100
150
200
250
300
1 2 4 8
SOURCE: INTEL MEASURED MARCH’13
• 2S Intel® Xeon® Processor E5-2670
• Intel® Xeon Phi™ coprocessor–native
(pre-production HW/SW)
• 2S Intel Xeon E5-2670 +
Intel® Xeon Phi™ coprocessor–symmetric
(pre-production HW/SW)
INTEL CONFIDENTIAL
• Application: Seismic imaging technique used to obtain
a subsurface depth image from input seismic data
• Status: See presentation Rice O&G HPC workshop,
http://rice2013.og-hpc.org/technical-program
• Execution Model: Fully Hybrid MPI+OpenMP using
symmetric mode
– Highly scalable on cluster
• Code Optimization:
– Minimal source code changes for dynamic
load balancing
Performance Proof-Point: Energy Industry
CGG: WAVE EQUATION MIGRATION (WEM)
1
2.57
3.57
6.14
0
1
2
3
4
5
6
7
Speedup
(Higher is Better)
• 2S Intel® Xeon® processor E5-2670
4 MPI / 4 OMP
• Intel® Xeon Phi™ Coprocessor
(pre-production HW/SW) 12 MPI / 20 OMP
• 2S Intel Xeon processor E5-2670 (4/4)
+ Intel® Xeon Phi™ coprocessor (12/20)
(pre-production HW/SW)
• 2S Intel Xeon processor E5-2670 (4/4)
+ 2x Intel® Xeon Phi™ coprocessor
(12/20 + 12/20) (pre-production HW/SW)
26 SOURCE: ARSLAN ET AL., CGG 2013, MARCH’13
INTEL CONFIDENTIAL
• Application: Monte Carlo algorithms are used to
evaluate complex instruments, portfolios, and
investments. Performance depends on raw
computational power and the performance of exp2()
• Status: Case Study available
• Highlights: Dramatic performance scaling for both
single-precision and double-precision calculations
• Demonstrated Results:
- Intel® Xeon Phi™ coprocessor fast exp2() and FMA
instructions deliver high performance, high accuracy
for single precision computations
- Compiler based loop unrolling delivers high performance
- Cache blocking further optimizes cache utilization,
reduces cache misses, and makes outer loop
vectorization possible
• Read the Case Study: software.intel.com/en-us/articles/case-
study-achieving-high-performance-on-monte-carlo-european-option-
on-intel-xeon-phi
27
Performance Proof-Point: Financial Services
MONTE CARLO EUROPEAN OPTIONS
1 1
10.36
3.34
0
2
4
6
8
10
12
Single
Precision
Double
Precision
Speedup
(Higher is Better)
• 2S Intel® Xeon® processor E5-2670
• 2S Intel Xeon processor E5-2670 +
Intel® Xeon Phi™ Coprocessor
(pre-production HW/SW)
SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013
INTEL CONFIDENTIAL
• Application: Weather Research and Forecasting (WRF)
• Status: WRF V3.5 was released 4/18/13
• Code Optimization:
– Approximately two dozen files with less than 2,000
lines of code were modified (out of approximately
700,000 lines of code in about 800 files, all Fortran
standard compliant)
– Most modifications improved performance for both the
host and the co-processors
• Performance Measurements: Pre release of WRF 3.5
(V3.5Pre) and NCAR supported CONUS2.5KM
benchmark (a high resolution weather forecast)
• Acknowledgments: There were many contributors to
these results, including the National Renewable Energy
Laboratory and The Weather Channel Companies
Performance Proof-Point: Government and Academic Research
WEATHER RESEARCH AND FORECASTING (WRF)
1
1.4
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Speedup
(Higher is Better)
• 2S Intel® Xeon® processor E5-2670 with
eight-node cluster configuration
• 2S Intel® Xeon® processor E5-2670 +
Intel® Xeon Phi™ coprocessor
(pre-production HW/SW)
with eight-node cluster configuration
28 SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013
INTEL CONFIDENTIAL
• Application: Sandia National Laboratories' best
approximation to an unstructured implicit finite
element or finite volume application in fewer than
8000 lines of code
• Status: available at
http://software.sandia.gov/trac/mantevo/browser/trunk/packages
• Demonstrated Results:
- Porting was easy using OpenMP
- Substituting an Intel MKL routine for the sparse matrix-
vector product accelerated performance
and will simplify future optimization
- The Intel MPI Library enables rapid performance
improvement when adding an Intel® Xeon Phi™
coprocessor
• Read the Case Study:
29
Performance Proof-Point: Government and Academic Research
SANDIA MANTEVO miniFE
1
2.2
0
0.5
1
1.5
2
2.5
Speedup
(Higher is Better)
• 2S Intel® Xeon® processor E5-2670
• 2S Intel Xeon processor E5-2670 +
Intel® Xeon Phi™ coprocessor
(pre-production HW/SW)
SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2012
“The programming models available for the Intel
MIC Architecture are open-standard and portable
between traditional processors and Intel Xeon Phi
coprocessors. This should allow us to leverage
code development across multiple platforms.”
James A. Ang, Ph.D.
Extreme-scale Computing, Sandia National Laboratories
software.intel.com/
en-us/articles/running-minife-on-intel-xeon-phi-coprocessors
INTEL CONFIDENTIAL30
DEMONSTRATED PERFORMANCE BENEFITS
Intel® Xeon Phi™ Coprocessor
UP TO
2.23X
Acceleware 8th
Order Isotropic
Variable Velocity2
Seismic
UP TO
2X
Sandia National
Labs MiniFE1
Finite Element Analysis
30
1. 8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero)
2. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted)
3. 2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload)
UP TO
3.54X
China Oil & Gas
Geoeast Pre-stack
Time Migration3
SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012
INTEL CONFIDENTIAL31
DEMONSTRATED PERFORMANCE BENEFITS
Intel® Xeon Phi™ Coprocessor
UP TO
10.75X
Monte Carlo SP3
Finance
UP TO
2.7X
Jefferson Lab
Lattice QCD
Physics
UP TO 7XBlack-Scholes SP3
31
Notes:
1. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor unless otherwise noted)
2. Intel Measured Oct. 2012
3. Includes additional FLOPS from transcendental function unit
SPEED-UP
2.11X
Intel Labs
Ray Tracing2
Embree Ray Tracing
SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012
32
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References
INTEL CONFIDENTIAL
• System: TACC Stampede is a 10 petaflop
supercomputer, one of the largest computing systems
in the world for open science research. It became
operational on January 7, 2013
• Status: In Service
• Workloads: Runs hundreds of applications for
thousands of users around the world
• Performance:
– More than 7 petaflops using Intel® Xeon Phi™
coprocessors1
– More than 2 petaflops using the Intel® Xeon®
processor E5 family1
• More Information:
– SC12 interview: insidehpc.com/2012/12/06/video-
intel-xeon-phi-powers-7-tacc-stampede-super/
– TACC HPC systems overview:
www.tacc.utexas.edu/resources/hpc
Implementation Proof-Point: Government and Academic Research
Texas Advanced Computing Center (TACC)
33
1 http://www.tacc.utexas.edu/resources/hpc/stampede
INTEL CONFIDENTIAL
System: Located in Southwest China, it contains
16,000 nodes composing the world's largest (public)
installation of Intel Ivy Bridge and Xeon Phi’s
processors. Each cluster node is formed with
• 2 CPUs hex-core Intel® Xeon® Ivy-Bridge @ 2.2GHz
• 3 Intel® Xeon Phi™ cards, each with 57 cores @ 1.1GHz
Performance: Theoretical peak of 54.9 Pflop/s
• 6.8 Pflop/s from 32,000 Xeon Ivy Bridge sockets
• 48.1 Pflop/s from 48,000 Xeon Phi cards
• for a total of 3,120,000 cores.
30.65 Pflop/s sustained Linpack.
More Information: "Visit to the National University
for Defense Technology Changsha, China." Jack
Dongarra, University of Tennessee, and Oak Ridge
National Laboratory. June 2013.
www.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-
dongarra-report.pdf
Tianhe-2 System: #1 June 2013 Top500 List
34
INTEL CONFIDENTIALOther brands and names are the property of their respective owners.
A Growing Sotware Ecosystem:
Developing today on Intel® Xeon Phi™ coprocessors
Shown at SC’12, November 2012
35
36
Introduction
High-level overview of the Intel® Xeon Phi™ platform:
Hardware and Software
Intel Xeon Phi Case Studies
Intel Xeon Phi Ecosystem
Conclusions & References
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
Conclusions
Intel® Xeon Phi™ coprocessor advantages:
• Comparable performance potential to other accelerators
• Faster time to solution due to reduced development effort
• Better investment protection with a single code base for
processors and coprocessors
Flexible and Wide range of programming models: from
pure Native to Offloaded – and all variants between
All with the familiar Intel development environment
37
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
One Stop Shop for:
Tools & Software Downloads
Getting Started Development Guides
Video Workshops, Tutorials, & Events
Code Samples & Case Studies
Articles, Forums, & Blogs
Associated Product Links
http://software.intel.com/mic-developer
Intel® Xeon Phi™ Coprocessor Developer
Site: http://software.intel.com/mic-developer
38
Obrigado.
Copyright© 2013, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Click to edit Master text styles
• Second level
– Third level
– Fourth level
– Fifth level
Click to edit Master title style
© 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S.
and/or other countries. *Other names and brands may be claimed as the property of others.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY
ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR
IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY
PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on
Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of
those factors may cause the results to vary. You should consult other information and performance
tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core,
VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended
for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for
Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information
regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
40

Weitere ähnliche Inhalte

Mehr von Intel Software Brasil

Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Intel Software Brasil
 

Mehr von Intel Software Brasil (20)

Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™  Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™
 
Escreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatEscreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKat
 
Desafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaDesafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento Multiplataforma
 
Desafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaDesafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataforma
 
Yocto - 7 masters
Yocto - 7 mastersYocto - 7 masters
Yocto - 7 masters
 
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XEGetting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
 
Intel tools to optimize HPC systems
Intel tools to optimize HPC systemsIntel tools to optimize HPC systems
Intel tools to optimize HPC systems
 
Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...
 
Principais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaPrincipais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralela
 
Principais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoPrincipais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorização
 
Notes on NUMA architecture
Notes on NUMA architectureNotes on NUMA architecture
Notes on NUMA architecture
 
Intel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Technologies for High Performance Computing
Intel Technologies for High Performance Computing
 
Benchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoBenchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenho
 
Yocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoYocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/Vivo
 
Html5 fisl15
Html5 fisl15Html5 fisl15
Html5 fisl15
 
IoT FISL15
IoT FISL15IoT FISL15
IoT FISL15
 
IoT TDC Floripa 2014
IoT TDC Floripa 2014IoT TDC Floripa 2014
IoT TDC Floripa 2014
 
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
 
Html5 tdc floripa_2014
Html5 tdc floripa_2014Html5 tdc floripa_2014
Html5 tdc floripa_2014
 
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Introdução ao coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013

  • 1. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Introduction to the Intel® Xeon Phi™ Coprocessor Leo Borges (leonardo.borges@intel.com) Intel - Software and Services Group iStep-Brazil, August 2013 1
  • 2. Click to edit Master title style 2 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Intel Xeon Phi Case Studies Intel Xeon Phi Ecosystem Conclusions & References
  • 3. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Large Scale Clusters for Test & Optimization Tera- Scale Research Leading Performance, Energy Efficient Platform Building Blocks Dedicated, Renowned Applications Expertise Broad Software Tools Portfolio Defined HPC Application Platform Many Integrated Core Architecture Manufacturing Process Technologies Exa-Scale Labs A long term commitment to the HPC market segment 3 Intel in High-Performance Computing
  • 4. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. HPC Processor Solutions Common Intel Environment Portable code, common tools Xeon® General Purpose Architecture Leadership Per Core Performance FP/core via AVX Multi-Core Performance Intel® Xeon Phi™ Coprocessor Trades a “big” IA core for multiple lower performance IA cores resulting in higher performance for a subset of highly parallel applications EN General purpose perf/watt EP Max perf/watt w/ Higher Memory BW / freq and QPI ideal for HPC Xeon EX Additional sockets & big memory EP 4S Additional compute density Multi-Core Many-Core 4
  • 5. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 5 Highly parallel and vectorized applications, or with need for higher memory bandwidth, will run even faster on Intel® Xeon Phi™ Coprocessors Most applications will still run best on multi- core Intel® Xeon® processors Optimizing code often delivers significant performance gains RUNNING EXISTING SERIAL SOFTWARE RUNNING OPTIMIZED SOFTWARE Big Gains for Selected Applications Medical imaging and biophysics Computer Aided Design & Manufacturing Climate modeling & weather prediction Financial analyses, trading Energy &oil exploration Digital content creation
  • 6. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 6 YES Evaluating Your Applications for Intel® Xeon Phi™ NO YES YES YES Can your workload benefit from more memory bandwidth? Can your workload benefit from large vectors? NO NO Can your workload scale to over 100 threads? Use Intel® Xeon Phi™ coprocessors for applications that scale with: • Threads • Vectors • Memory Bandwidth
  • 7. Click to edit Master title style 7 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Intel Xeon Phi Case Studies Intel Xeon Phi Ecosystem Conclusions & References
  • 8. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 8 Intel Many Integrated Core (MIC, pronounced “Mike”) Product Family/Architecture for Highly Parallel Applications • Based on large number of smaller, low power, Intel Arch. Cores • 512-bit wide vector engine • Compliments Intel Xeon processor product line • Provides breakthrough performance for highly parallel apps – Familiar x86 programming model – Same source code supports both Intel Xeon processor & Intel Xeon Phi coprocessor – Initially a coprocessor with PCI Express form factor First products announced at SC12: Code named Knights Corner (KNC) • Up to 61 cores, 4 threads per core • Up to 16GB GDDR5 memory (up to 352 GB/s) • 225-300W (Cooling: Both passive & active SKUs) • x16 PCIe Form-Factor (requires IA host) 8 Intel® Xeon® Phi™ Product Family Based on the Intel MIC Architecture
  • 9. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 9 Each Xeon Phi can be addressed as an Individual Node in the Cluster • 9 6 to 16 GB GDDR5 memory
  • 10. INTEL CONFIDENTIAL • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style 10 © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 3 Family Outstanding Parallel Computing Solution Performance/$ leadership Intel® Xeon Phi™ Coprocessors 3120P 3120A 5 Family Optimized for High Density Environments Performance/watt leadership 5120D 7 Family Highest Level of Features Performance leadership 7120P 7120X 16GB GDDR5 352 GB/s > 1.2 TFlops DP Turbo T 8GB GDDR5 >300 GB/s >1 TFlops DP 6GB GDDR5 240 GB/s >1 TFlops DP 5120P
  • 11. Click to edit Master title style 11 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Performance Considerations Intel Xeon Phi Case Studies Intel Xeon Phi Ecosystem Conclusions & References
  • 12. 12 Based on memory access and flops required • Temporal/spatial locality of data • Bandwidth Requirement 6 GB/s Bandwidth Limited Core Limited Stream-triad BLAS1 & BLAS 2 All Linpack DGEMM Mfg & Scientific Sparse Matrix- Vector Scientific SPECfp2000 All Reservoir Simulation FTDT Oil & Gas Kirchhoff Migration Oil & Gas Fluid Dynamics Ocean Models ScientificFFT Oil & Gas Mil HPC (Y: Math Kernel; B: Applications; W: Segment) Option pricing FSI Molecular Dynamic Scientific Application Characterization RTM Oil & Gas
  • 13. INTEL CONFIDENTIAL13 75 171 0 50 100 150 200 STREAM Triad (GB/s) 330 802 0 200 400 600 800 1000 SMP Linpack (GF/s) 347 887 0 200 400 600 800 1000 DGEMM (GF/s) 728 1,796 0 500 1000 1500 2000 SGEMM (GF/s) Notes 1. Intel® Xeon® Processor E5-2680 used for all SGEMM Matrix = 12800 x 12800 , DGEMM Matrix 10752 x 10752, SMP Linpack Matrix 26000 x 26000 2. Intel® Xeon Phi™ coprocessor SE10P (ECC on) with “Gold” SW stack SGEMM Matrix = 12800 x 12800, DGEMM Matrix 12800 x 12800, SMP Linpack Matrix 26872 x 28672 3. Average single-node results from measurements across a set of nodes from the TACC+ Stampede* Cluster + Texas Advanced Computing Center (TACC) at the University of Texas at Austin. ++ Measured on the TACC+ Stampede Cluster Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native) Synthetic Benchmarks Intel® Xeon Phi™ Coprocessor and Intel® MKL UP TO 2.4X UP TO 2.5X UP TO 2.2X UP TO 2.4X Higher is Better • 2S Intel® Xeon® • Intel Xeon Phi ECC ON84% Efficient 83% Efficient 75% Efficient
  • 14. INTEL CONFIDENTIAL 1.00 3.91 4.63 4.81 0.00 1.00 2.00 3.00 4.00 5.00 6.00 2S Intel® Xeon® Processor SMP Linpack DGEMM SGEMM RelativePerformanceperWatt (Normalizedto1.0Baselineofa2socketIntel® Xeon®processorE5-2670) Performance per Watt Intel® Xeon Phi™ Coprocessor vs. 2S Intel® Xeon® processor (Intel MKL) 14 1 Intel® Xeon Phi™ Coprocessor vs. 2 Socket Intel® Xeon® processor Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes. For more information go to http://www.intel.com/performance Notes: 1. 2 X Intel® Xeon® Processor E5-2670 (2.6GHz, 8C, 115W) 2. Intel® Xeon Phi™ coprocessor 5110P (ECC on) with Gold RC SW stack (Coprocessor power only) Higher is Better Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native) 5110P
  • 15. Click to edit Master title style 15 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Native, Offload and Variations Intel Xeon Phi Case Studies Intel Xeon Phi Ecosystem Conclusions & References
  • 16. INTEL CONFIDENTIAL • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Wide Spectrum of Execution Models General purpose serial and parallel computing Codes with highly- parallel phases Highly-parallel codes Codes with balanced needs Main( ) Foo( ) MPI_*() Foo( ) Main( ) Foo( ) MPI_*() Main() Foo( ) MPI_*() Main( ) Foo( ) MPI_*() Main( ) Foo( ) MPI_*() Multicore Many-core Multicore Centric Many-core Centric (Intel® Xeon® processors) (Intel® Many Integrated Core co-processors) Multi-core-hosted Offload Symmetric Many-core-hosted Range of Models to Meet Application Needs 16
  • 17. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. The Intel Manycore Platform Software Stack (MPSS) provides Linux on the coprocessor 17 Linux* OS Intel® Xeon Phi™ Coprocessor support libraries, tools, and drivers Linux* OS PCI-E Bus PCI-E Bus Intel® Xeon Phi™ Coprocessor communication and application- launch support Intel® Xeon Phi™ CoprocessorHost Processor System-level code System-level code User-level codeUser-level code
  • 18. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Runs either as an accelerator for offloaded host computation… 18 Linux* OS Intel® Xeon Phi™ Coprocessor support libraries, tools, and drivers Linux* OS PCI-E Bus PCI-E Bus Intel® Xeon Phi™ Coprocessor communication and application- launch support Intel® Xeon Phi™ CoprocessorHost Processor System-level code System-level code User-level codeUser-level code Offload libraries, user-level driver, user-accessible APIs and libraries User code Host-side offload application User code Offload libraries, user-accessible APIs and libraries Target-side offload applicationAdvantages • More memory available • Better file access • Host better on serial code • Better uses resources
  • 19. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. …Or runs as a native or MPI* compute node via IP or OFED 19 Linux* OS Intel® Xeon Phi™ Coprocessor support libraries, tools, and drivers Linux* OS PCI-E Bus PCI-E Bus Intel® Xeon Phi™ Coprocessor communication and application- launch support Intel® Xeon Phi™ CoprocessorHost Processor System-level code System-level code User-level codeUser-level code Advantages • Simpler model • No directives • Easier port • Good kernel test ssh or telnet connection to coprocessor IP address Virtual terminal session Use if • Not serial • Modest memory • Complex code Target-side “native” application User code Standard OS libraries plus any 3rd-party or Intel libraries IB fabric
  • 20. © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Flexible: Enables Multiple Programming Models 20 CPU MIC CPU MIC Data MPI Data Network Homogenous network of many-core CPUs CPU MIC CPU MIC Data MPI Data Network Data Data Heterogeneous network of homogeneous CPUs CPU MIC CPU MIC MPI Offload Offload Network Data Data Homogenous network of heterogeneous nodes Coprocessor only Host+Offload Symmetric
  • 21. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master text styles • Second level – Third level – Fourth level – Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Advisor XE VTune Amplifier XE Inspector XE Trace Analyzer Code Analysis Comprehensive set of SW tools for Xeon and Xeon Phi Programing Intel Cilk Plus Threading Building Blocks OpenMP OpenCL MPI Offload/Native/MYO Programming Models Math Kernel Library Integrated Performance Primitives Intel Compilers Libraries & Compilers 21
  • 22. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master title style First Level • Second level – Third level – Fourth level – Fifth level INTEL CONFIDENTIAL 22 • Click to edit Master text styles ‒ Second level  Third level o Fourth level  Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Options for Thread Parallelism Intel® Math Kernel Library OpenMP* Intel® Threading Building Blocks Intel® Cilk™ Plus OpenCL* Pthreads* and other threading libraries Programmer control Ease of use / code maintainability Choice of unified programming to target Intel® Xeon® and Intel® Xeon Phi™ Architecture! 22
  • 23. Click to edit Master title style 23 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Intel Xeon Phi Case Studies Intel Xeon Phi Ecosystem Conclusions & References
  • 24. INTEL CONFIDENTIAL24 145X FASTER 0.46 SECONDS STEP 1. OPTIMIZE CODE Parallelize and vectorize code and continue to run on multi-core Intel Xeon processors 67.097 SECONDS Current Performance STARTING POINT Unoptimized serial code running on multi-core Intel® Xeon® processors 2.3X FASTER 0.197 SECONDS STEP 2. USE COPROCESSORS Run all or part of the optimized code on Intel® Xeon Phi™ coprocessors The Following Performance Results are Based on Already Optimized Code SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012 Example: A Two-Step Process with SAXPY Parallelizing for High Performance 340X FASTER
  • 25. INTEL CONFIDENTIAL • Application: Hybrid Monte-Carlo program that simulates lattice QCD with dynamical Wilson fermions. It is one of the main production programs of the QCDSF collaboration (DEISA) and beyond used for quark simulation. • Status: Many optimizations already in released version; more optimizations and alternative offload model version in development • Demonstrated Results: - No source code changes - Recompiled, selected run-time parameters to get maximum performance 25 Performance Proof-Point: Government and Academic Research BQCD “The performance improvement for BQCD using the Intel Xeon Phi coprocessor was reached in record time, requiring only recompilation. We are confident that larger speed-ups can be obtained with modest modifications of the code.” Prof. Dr. Tilo Wettig Principal Investigator of the QPACE project BQCD Scalability Gflops/Sec (Higher is Better) 0 50 100 150 200 250 300 1 2 4 8 SOURCE: INTEL MEASURED MARCH’13 • 2S Intel® Xeon® Processor E5-2670 • Intel® Xeon Phi™ coprocessor–native (pre-production HW/SW) • 2S Intel Xeon E5-2670 + Intel® Xeon Phi™ coprocessor–symmetric (pre-production HW/SW)
  • 26. INTEL CONFIDENTIAL • Application: Seismic imaging technique used to obtain a subsurface depth image from input seismic data • Status: See presentation Rice O&G HPC workshop, http://rice2013.og-hpc.org/technical-program • Execution Model: Fully Hybrid MPI+OpenMP using symmetric mode – Highly scalable on cluster • Code Optimization: – Minimal source code changes for dynamic load balancing Performance Proof-Point: Energy Industry CGG: WAVE EQUATION MIGRATION (WEM) 1 2.57 3.57 6.14 0 1 2 3 4 5 6 7 Speedup (Higher is Better) • 2S Intel® Xeon® processor E5-2670 4 MPI / 4 OMP • Intel® Xeon Phi™ Coprocessor (pre-production HW/SW) 12 MPI / 20 OMP • 2S Intel Xeon processor E5-2670 (4/4) + Intel® Xeon Phi™ coprocessor (12/20) (pre-production HW/SW) • 2S Intel Xeon processor E5-2670 (4/4) + 2x Intel® Xeon Phi™ coprocessor (12/20 + 12/20) (pre-production HW/SW) 26 SOURCE: ARSLAN ET AL., CGG 2013, MARCH’13
  • 27. INTEL CONFIDENTIAL • Application: Monte Carlo algorithms are used to evaluate complex instruments, portfolios, and investments. Performance depends on raw computational power and the performance of exp2() • Status: Case Study available • Highlights: Dramatic performance scaling for both single-precision and double-precision calculations • Demonstrated Results: - Intel® Xeon Phi™ coprocessor fast exp2() and FMA instructions deliver high performance, high accuracy for single precision computations - Compiler based loop unrolling delivers high performance - Cache blocking further optimizes cache utilization, reduces cache misses, and makes outer loop vectorization possible • Read the Case Study: software.intel.com/en-us/articles/case- study-achieving-high-performance-on-monte-carlo-european-option- on-intel-xeon-phi 27 Performance Proof-Point: Financial Services MONTE CARLO EUROPEAN OPTIONS 1 1 10.36 3.34 0 2 4 6 8 10 12 Single Precision Double Precision Speedup (Higher is Better) • 2S Intel® Xeon® processor E5-2670 • 2S Intel Xeon processor E5-2670 + Intel® Xeon Phi™ Coprocessor (pre-production HW/SW) SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013
  • 28. INTEL CONFIDENTIAL • Application: Weather Research and Forecasting (WRF) • Status: WRF V3.5 was released 4/18/13 • Code Optimization: – Approximately two dozen files with less than 2,000 lines of code were modified (out of approximately 700,000 lines of code in about 800 files, all Fortran standard compliant) – Most modifications improved performance for both the host and the co-processors • Performance Measurements: Pre release of WRF 3.5 (V3.5Pre) and NCAR supported CONUS2.5KM benchmark (a high resolution weather forecast) • Acknowledgments: There were many contributors to these results, including the National Renewable Energy Laboratory and The Weather Channel Companies Performance Proof-Point: Government and Academic Research WEATHER RESEARCH AND FORECASTING (WRF) 1 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Speedup (Higher is Better) • 2S Intel® Xeon® processor E5-2670 with eight-node cluster configuration • 2S Intel® Xeon® processor E5-2670 + Intel® Xeon Phi™ coprocessor (pre-production HW/SW) with eight-node cluster configuration 28 SOURCE: INTEL MEASURED RESULTS AS OF JULY, 2013
  • 29. INTEL CONFIDENTIAL • Application: Sandia National Laboratories' best approximation to an unstructured implicit finite element or finite volume application in fewer than 8000 lines of code • Status: available at http://software.sandia.gov/trac/mantevo/browser/trunk/packages • Demonstrated Results: - Porting was easy using OpenMP - Substituting an Intel MKL routine for the sparse matrix- vector product accelerated performance and will simplify future optimization - The Intel MPI Library enables rapid performance improvement when adding an Intel® Xeon Phi™ coprocessor • Read the Case Study: 29 Performance Proof-Point: Government and Academic Research SANDIA MANTEVO miniFE 1 2.2 0 0.5 1 1.5 2 2.5 Speedup (Higher is Better) • 2S Intel® Xeon® processor E5-2670 • 2S Intel Xeon processor E5-2670 + Intel® Xeon Phi™ coprocessor (pre-production HW/SW) SOURCE: INTEL MEASURED RESULTS AS OF MARCH, 2012 “The programming models available for the Intel MIC Architecture are open-standard and portable between traditional processors and Intel Xeon Phi coprocessors. This should allow us to leverage code development across multiple platforms.” James A. Ang, Ph.D. Extreme-scale Computing, Sandia National Laboratories software.intel.com/ en-us/articles/running-minife-on-intel-xeon-phi-coprocessors
  • 30. INTEL CONFIDENTIAL30 DEMONSTRATED PERFORMANCE BENEFITS Intel® Xeon Phi™ Coprocessor UP TO 2.23X Acceleware 8th Order Isotropic Variable Velocity2 Seismic UP TO 2X Sandia National Labs MiniFE1 Finite Element Analysis 30 1. 8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero) 2. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor (unless otherwise noted) 3. 2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload) UP TO 3.54X China Oil & Gas Geoeast Pre-stack Time Migration3 SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012
  • 31. INTEL CONFIDENTIAL31 DEMONSTRATED PERFORMANCE BENEFITS Intel® Xeon Phi™ Coprocessor UP TO 10.75X Monte Carlo SP3 Finance UP TO 2.7X Jefferson Lab Lattice QCD Physics UP TO 7XBlack-Scholes SP3 31 Notes: 1. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor unless otherwise noted) 2. Intel Measured Oct. 2012 3. Includes additional FLOPS from transcendental function unit SPEED-UP 2.11X Intel Labs Ray Tracing2 Embree Ray Tracing SOURCE: INTEL MEASURED RESULTS AS OF NOVEMBER, 2012
  • 32. 32 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Intel Xeon Phi Case Studies Intel Xeon Phi Ecosystem Conclusions & References
  • 33. INTEL CONFIDENTIAL • System: TACC Stampede is a 10 petaflop supercomputer, one of the largest computing systems in the world for open science research. It became operational on January 7, 2013 • Status: In Service • Workloads: Runs hundreds of applications for thousands of users around the world • Performance: – More than 7 petaflops using Intel® Xeon Phi™ coprocessors1 – More than 2 petaflops using the Intel® Xeon® processor E5 family1 • More Information: – SC12 interview: insidehpc.com/2012/12/06/video- intel-xeon-phi-powers-7-tacc-stampede-super/ – TACC HPC systems overview: www.tacc.utexas.edu/resources/hpc Implementation Proof-Point: Government and Academic Research Texas Advanced Computing Center (TACC) 33 1 http://www.tacc.utexas.edu/resources/hpc/stampede
  • 34. INTEL CONFIDENTIAL System: Located in Southwest China, it contains 16,000 nodes composing the world's largest (public) installation of Intel Ivy Bridge and Xeon Phi’s processors. Each cluster node is formed with • 2 CPUs hex-core Intel® Xeon® Ivy-Bridge @ 2.2GHz • 3 Intel® Xeon Phi™ cards, each with 57 cores @ 1.1GHz Performance: Theoretical peak of 54.9 Pflop/s • 6.8 Pflop/s from 32,000 Xeon Ivy Bridge sockets • 48.1 Pflop/s from 48,000 Xeon Phi cards • for a total of 3,120,000 cores. 30.65 Pflop/s sustained Linpack. More Information: "Visit to the National University for Defense Technology Changsha, China." Jack Dongarra, University of Tennessee, and Oak Ridge National Laboratory. June 2013. www.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2- dongarra-report.pdf Tianhe-2 System: #1 June 2013 Top500 List 34
  • 35. INTEL CONFIDENTIALOther brands and names are the property of their respective owners. A Growing Sotware Ecosystem: Developing today on Intel® Xeon Phi™ coprocessors Shown at SC’12, November 2012 35
  • 36. 36 Introduction High-level overview of the Intel® Xeon Phi™ platform: Hardware and Software Intel Xeon Phi Case Studies Intel Xeon Phi Ecosystem Conclusions & References
  • 37. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master text styles • Second level – Third level – Fourth level – Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Conclusions Intel® Xeon Phi™ coprocessor advantages: • Comparable performance potential to other accelerators • Faster time to solution due to reduced development effort • Better investment protection with a single code base for processors and coprocessors Flexible and Wide range of programming models: from pure Native to Offloaded – and all variants between All with the familiar Intel development environment 37
  • 38. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master text styles • Second level – Third level – Fourth level – Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. One Stop Shop for: Tools & Software Downloads Getting Started Development Guides Video Workshops, Tutorials, & Events Code Samples & Case Studies Articles, Forums, & Blogs Associated Product Links http://software.intel.com/mic-developer Intel® Xeon Phi™ Coprocessor Developer Site: http://software.intel.com/mic-developer 38
  • 40. Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Click to edit Master text styles • Second level – Third level – Fourth level – Fifth level Click to edit Master title style © 2013, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 Legal Disclaimer & Optimization Notice Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 40