Elijah Charles from Intel presented this deck at the 2016 HPC Advisory Council Switzerland Conference.
"The Exascale computing challenge is the current Holy Grail for high performance computing. It envisages building HPC systems capable of 10^18 floating point operations under a power input in the range of 20-40 MW. To achieve this feat, several barriers need to be overcome. These barriers or “walls” are not completely independent of each other, but present a lens through which HPC system design can be viewed as a whole, and its composing sub-systems optimized to overcome the persistent bottlenecks."
Watch the video presentation: http://wp.me/p3RLHQ-f7X
See more talks in the Switzerland HPC Conference Video Gallery: http://insidehpc.com/2016-swiss-hpc-conference/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
3. Risk Factors
3
The above statements and any others in this document that refer to plans and expectations for the second quarter, the year and the future are forward-
looking statements that involve a number of risks and uncertainties. Words such as "anticipates," "expects," "intends," "plans," "believes," "seeks,"
"estimates," "may," "will," "should" and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain
events or assumptions also identify forward-looking statements. Many factors could affect Intel's actual results, and variances from Intel's current
expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel
presently considers the following to be important factors that could cause actual results to differ materially from the company's expectations. Demand for
Intel's products is highly variable and could differ from expectations due to factors including changes in business and economic conditions; consumer
confidence or income levels; the introduction, availability and market acceptance of Intel's products, products used together with Intel products and
competitors' products; competitive and pricing pressures, including actions taken by competitors; supply constraints and other disruptions affecting
customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Intel's gross margin
percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the
timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated
costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; and product manufacturing
quality/yields. Variations in gross margin may also be caused by the timing of Intel product introductions and related expenses, including marketing
expenses, and Intel's ability to respond quickly to technological developments and to introduce new products or incorporate new features into existing
products, which may result in restructuring and asset impairment charges. Intel's results could be affected by adverse economic, social, political and
physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural
disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Results may also be affected by the formal or informal
imposition by countries of new or revised export and/or import and doing-business regulations, which could be changed without prior notice. Intel
operates in highly competitive industries and its operations have high costs that are either fixed or difficult to reduce in the short term. The amount, timing
and execution of Intel's stock repurchase program could be affected by changes in Intel's priorities for the use of cash, such as operational spending,
capital spending, acquisitions, and as a result of changes to Intel's cash flows or changes in tax laws. Product defects or errata (deviations from published
specifications) may adversely impact our expenses, revenues and reputation. Intel's results could be affected by litigation or regulatory matters involving
intellectual property, stockholder, consumer, antitrust, disclosure and other issues. An unfavorable ruling could include monetary damages or an
injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel's ability to design
its products, or requiring other remedies such as compulsory licensing of intellectual property. Intel's results may be affected by the timing of closing of
acquisitions, divestitures and other significant transactions. A detailed discussion of these and other factors that could affect Intel's results is included in
Intel's SEC filings, including the company's most recent reports on Form 10-Q, Form 10-K andearnings release.
Rev. 4/14/15
4. Agenda
4
• Accelerators: Motivation and Use Cases
• Using Field Programmable Gate Array (FPGA) as an Accelerator
• Intel® Xeon® Processor + FPGA Accelerator Platform
• Hardware and Software Programming Interfaces
• Example Applications
5. 50¹ Billion
DEVICES
Build out of
the CLOUD
$120B³
New
SERVICES
$450B²
1: Sources: AMS Research, Gartner,IDC, McKinsey Global Institute, and various others industry analysts and commentators
2: Source IDC, 2013. 2016 calculated base don reported CAGR ‘13-’17
4 3: Source: iDATA /Digiworld,2013
Digital Services Economy…
7. Cloud Economics
Amazon’s TCO Analysis¹
VMs per System
Web Transactions /Sec
Storage Capacity
Hadoop Queries
Workload Performance Metrics
1: Source: James Hamilton, Amazon* http://perspectives.mvdirona.com/2010/09/overall-data-center-costs/
Performance / TCO is the key metric
7
8. Diverse Data Center Demands
Accelerators can increase Performance at lower TCO for targeted workloads
8 Intel estimates; bubble size is relative CPU intensity
9. Agenda
9
• Accelerators: Motivation and Use Cases
• Using Field Programmable Gate Array (FPGA) as an Accelerator
• Intel® Xeon® Processor + FPGA Accelerator Platform
• Hardware and Software Programming Interfaces
• Example Applications
11. Benefits of Reconfigurable Accelerators:
Savings in Area /Power
• Can be configured to implement different functions efficiently
- Meeting performance goalsfor segment
- Saving area and power compared to multiple Fixed Functions
Fixed Functions
Cost
Programmable
Accelerator
Software
Performance
10
12. Benefits of Reconfigurable Accelerators:
Meeting Customer Needs for Differentiation
Workload
Optimized
Silicon
12
Pervasive
Analytics &
Insights
Intelligent
Resource
Orchestration
Dynamic
Resource
Pooling
Driving the Digital ServiceEconomy
13. What is a Field Programmable Gate Array (FPGA)?
FPGAs (Field Programmable Gate Arrays) are
semiconductor devices that can be programmed
13
• Desired functionality of the FPGA can be (re-) programmed
by downloading a configuration into the device
FPGAs offer several advantages over potential
alternatives:
• Lower one-time development cost, and faster time to market
compared to custom designed chips (ASICs)
• Ability to implement customer-specific functionality beyond
what is available from standard products (ASSPs)
• Customizable and reprogrammable after the device has
been deployed to the field compared to both ASIC and ASSP
Logic Blocks
Interconnect Resources
I/O Cells
14. A Complete Solutions Portfolio
CPLDs
Lowest Cost,
Lowest Power
PowerSoCs
High-efficiency
Power Management
FPGAs
Cost/PowerBalance
Design
Software
Development
Kits
Embedded Soft and
Hard Processors
FPGAs
Mid-range FPGAs
P O W E R I N G Y O U R I N N O V A T I O N
SoC & Transceivers SoC & Transceivers
R E S O U R C E S
FPGAs
Optimized for
High Bandwidth
Intellectual
Property (IP)
Industrial
Computing
Enterprise
1
16. OpenCL and FPGAs Address These Challenges
Power efficient acceleration
– Typically 1/5 power of GPU and orders of magnitude more performance per watt ofCPU
FPGA lifecycle over 15 years
– GPUs lifespan is short
Require re-optimization testing between generations
– FPGA OpenCL code retargeted to future devices without modification
Our OpenCL flow abstracts away FPGA hardware flow
– Puts FPGA into software engineers hands
Our OpenCL SDK allows for streaming IO channels and kernel
channels
– Data movement without host involvement
– Low latency data transmissions to accelerator
Shared virtual memory
– IBM CAPI and Intel QPI
16
17. More SW Engineering Resources than HW?
1000:1 software engineers to FPGA designers
Software engineers are not used to long compile
17
times
OpenCL Solves This!
Our OpenCL flow abstracts away FPGA hardware flow
bringing the FPGA to low level software programmers
Software developers write, optimize and debug in their software familiar
environment
Quartus is run behind the scenes
Emulator and profiler are software development tools
Pushing long compile times to end
OpenCL optimization doesn’t require a board
Allowing SW to drive board requirements (.xml file)
19. Agenda
19
• Accelerators: Motivation and Use Cases
• Using Field Programmable Gate Array (FPGA) as an Accelerator
• Intel® Xeon® Processor + FPGA Accelerator Platform
• Hardware and Software Programming Interfaces
• Example Applications
20. Intel® Xeon® E5 + Field Programmable Gate Array Software
Development Platform (SDP) Shipping Today
Intel QPI
DDR3
DDR3
DDR3
DDR3
DDR3
PCIe3.0x8
DMI2
PCIe3.0x8
PCIe3.0x8
PCIe3.0x8
PCIe3.0x8
PCIe3.0x8
DDR3
Intel Xeon
Processor E5
Product Family
FPGA
Processor Intel Xeon Processor E5
FPGA Module Altera* Stratix* V
QPI Speed 6.4 GT/s fullwidth
(target 8.0 GT/s at full width)
Memory to
FPGA Module
2 channels of DDR3
(up to 64 GB)
Expansion
connector
to FPGA Module
PCI Express® (PCIe) 3.0 x8
lanes - maybe used for direct
I/O e.g. Ethernet
Features
Configuration Agent,Caching
Agent, (optional) Memory
Controller
Software
Accelerator Abstraction Layer
(AAL) runtime, drivers, sample
applications
Software Development for Accelerating Workloads using Intel® Xeon® processors and coherently attached FPGA in-socket
20
Intel® QuickPath Interconnect (Intel® QPI)
21. System Logical View
• AFUs can access coherent cache on FPGA
• AFUs can “not” implement a second level cache
• Intel® Quick Path Interconnect (Intel® QPI) IP participates in cache coherency
with Processors
A F U s
Q P I
D R A M
D R A M
D D R
D R A M
P r o c e ss o r
C o re s L L C
F P G A
C C I
M u lt i-processor C o h e r e n c e D o m a i n C a c h e a c c e s s D o m a i n
C
a
c
h
e
21
In te l
Q P I
I P
22. Intel® Xeon® + Field Programmable Gate Array SDP: Intel®
Quick Path Interconnect 1.1 RTL Microarchitecture
• PHY – Implements the Intel QPI PHY 1.1
(Analog/Digital)
• Intel QPI Linklayer- provides flow control
and reliable communication
• Intel QPI Protocol – implements Intel QPI
Cache Agent + ConfigurationAgent
• Cache Controller – Cache hit/miss
determination and generates Intel QPI
protocol requests.
• Cache Tag – Tracks state of cacheline (MESI +
internal states for tracking outstanding
requests)
• Coherency Table – Programmable table that
implements coherency protocol rules
• System Protocol Layer (SPL2) – Implements
Address translation functionality. Can
provide up to 2GB device virtual address
space to AFU. SPL2 cannot handle page
faults.
• AFU – User designed Accelerator Function
Unit
Q P I L i n k / P r o t o c o l C o n t r o l
Q P I P H YR x A l i g n T x A l i g n
R x C o n t r o l T x C o n t r o l
C a c h e c o n t r o l l er
C a c h e
D a t a
C a c h e T a g
C a c h e T a b l e
R x
T x
S P L 2
C C I- E
R x
T x
C C I- S
Intel Q P I F P G A IP
6 4 0 bits6 4 0 bits
A d d r e s s translation
U s er:
Accelerator Func t i on Unit (A FU )
Intel® QuickPath Interconnect (Intel® QPI) Q P I int erf ac e t o p i n s22
23. Agenda
23
• Accelerators: Motivation and Use Cases
• Using Field Programmable Gate Array (FPGA) as an Accelerator
• Intel® Xeon® Processor + FPGA Accelerator Platform
• Hardware and Software Programming Interfaces
• Example Applications
25. Programming Interfaces
Host Application
Virtual Memory
API
Intel QPI/KTI Link,
Protocol, & PHY
CPU
Accelerator Function
Units (AFU)
CCI1
extended
Addr Translation
CCI1
standard
Service API
Physical Memory API
Interfaces
Accelerator
Abstraction
Layer
Field ProgrammableGate Array
25 Intel® QuickPath Interconnect (Intel® QPI) 2. Software Development Platform 4. Register Transfer Level
Intel QPI
Standard Programming Interfaces : AAL and CCI
Programming interfaces will be forward compatible from SDP2 to future MCP3 solutions
Simulation Environment available for development of SW and RTL4
1. Coherent Cache Interface 3. Multi-chip package
26. Programming Interfaces: OpenCL™
OpenCL
Application
Virtual Memory API VirtMem
CPU
OpenCL Kernels
CCI
Extended
CCI
Standard
Service API
Physical Memory API
Accelerator
Abstraction
Layer
C
F
G
Physical Memory API
OpenCL RunTime
OpenCL™
Host Code
OpenCL
Kernel Code
Field Programmable Gate Array
Intel QPI/PCI Express®
System Memory
Unified application code abstracted from the hardware environment
Portable across generations and families of CPUs and FPGAs
20 Intel® QuickPath Interconnect (Intel® QPI)
27. Agenda
21
• Accelerators: Motivation and Use Cases
• Using Field Programmable Gate Array (FPGA) as an Accelerator
• Intel® Xeon® Processor + FPGA Accelerator Platform
• Hardware and Software Programming Interfaces
• Example Applications
28. Example Usage:
Deep Learning Framework for Visual Understanding
clusternodedeviceprimitives
DMA
Weights
Inputs
O
utputs
Processing Tile ‘n’
Processing Tile 1
Processing Tile 0
PE PE PE
Read Write Reg
Access
Control
State
Machine
IP
Registers
CCI Interface
SRAM Controller
CNN (Convolutional Neural Network) function accelerated on FPGA:
Power-performance of CNN classification boosted up to 2.2X†
22 microbenchmark. In order to sustain ~2400 img/s we need a I/O bandwidth of ~500 MB/s, which can be supported by a 10GigE link and software stack
†Source: Intel Measured (Intel® Xeon® processor E5-2699v3 results; Altera Estimated (4x Arria-10 results)
2S Intel( Xeon E5-2699v3 + 4x GX1150 PCI Express® cards. Most computations executed on Arria-10 FPGA's, 2S Intel Xeon E5-2699v3 host assumed to be near idle, doing misc. networking/housekeeping functions.
Arria-10 results estimated by Altera with Altera custom classification network. 2x Intel Xeon E5-2699v3 power estimated @ 139W while doing "housekeeping" for GX1150 cards based on Intel measured
29. Example Usage:
HaplotypeCaller (PairHMM
Genomics Analysis Toolkit
BWA mem (Smith-Waterman
PairHMM function accelerated on FPGA:
Power-performance of pHMM boosted up to 3.8X†
23 essentially idle when work load is offloaded to the FPGA)
†pHMM Algorithm performance is measured in terms of Millions Cell Updates per seconds (CUPS).
Performance projections: CPU Performance: includes: 1 core Intel® Xeon® processor E5-2680v2 @ 2.8GHz delivers 2101.1 MCUP/s measured; estimated value assumes linear scaling to 10 Cores on Xeon ES2680v2 @
2.8 GHz & 115W TDP; FPGA Performance includes: 1 FPGA PE (Processing Engine) delivers 408.9 MCUP/s @ 200 MHz measured; estimated value assumes linear scaling to 32 PEs and 90% frequency scaling on Stratix-
V A7 400 MHz based on RTL Synthesis results (35W TDP). Intel estimated based on 1S Xeon E5-2680v2 + 1 Stratix-V A7 with QPI 1.1 @ 6.4 GT/s full width using Intel® QuickAssist FPGA System Release 3.3, ICC (CPU is
30. Intel® Xeon® + FPGA1 in the Cloud
Vision
Workload
Static/dynamic
FPGA programming
Place
workload
Intel® Xeon®
+FPGA
Orchestration Software
Intel
Developed IP
3rd party
Developed IP
Resource Pool
Storage Network Compute
Software
Defined
Infrastructure
FPGA Vendor
Developed IP
Cloud Users
IP Library
End User
Developed IP
Launch workload
1: Field Programmable GateArray (FPGA)30
Workload
accelerators
34. Intel Architecture Vision for Software:
Code Once – Run Anywhere
Software
Library
34
Processor
Instruction
Discrete
Accelerator
Integrated
Accelerator
Consistent programing model
for all accelerators
35. Additional Sources of Information
35
• A PDF of this presentation is available from our Technical Session Catalog:
www.intel.com/idfsessionsSF.
• Intel® Xeon Phi™ coprocessor resources: software.intel.com/mic-developer
• Network Compression resources: intel.com/quickassist
• Media Transcoding resources: software.intel.com/intel-media-server-studio
• Storage Cryptography resources: software.intel.com/storage
• FPGA: Please see demo in Altera* booth in the demo showcase