HPC DAY 2017 - http://www.hpcday.eu/
Accelerating tomorrow's HPC and AI workflows with Intel Architecture
Atanas Atanasov | HPC solution architect, EMEA region at Intel
4. 4
Agenda
• Challenges in HPC/AI and SSF
• Compute: Xeon Scalable Family
• Fabric: Omni-Path
• Storage: Optane
• AI: Nervana
5. 2
HPCisFoundationaltoInsight
Aerospace Biology Brain Modeling Chemistry/Chemical Engineering Climate Computer Aided Engineering Cosmology Cybersecurity Defense
Pharmacology Particle Physics Metallurgy Manufacturing / Design Life Sciences Government Lab Geosciences / Oil & Gas Genomics Fluid Dynamics
1Source: IDC HPC and ROI Study Update (September 2015)
2Source: IDC 2015 Q1 World Wide x86 Sever Tracker vs IDC 2015 Q1 World Wide HPC Sever Tracker
DigitalContentCreationEDAEconomics/FinancialServicesFraudDetection
SocialSciences;Literature,linguistics,marketingUniversityAcademicWeather
Business
Innovation
A New Science
Paradigm
Fundamental
Discovery
High ROI:
$515
Average Return Per $1 of HPC
Investment1
Advancing Science
And Our Understanding
of the Universe
Data-Driven Analytics
Joins Theory, Experimentation, and
Computational Science
6. 2
Growing Challenges in HPC
“The Walls”
System Bottlenecks
Memory | I/O | Storage
Energy Efficient Performance
Space | Resiliency |
Unoptimized Software
Divergent
Infrastructure
Barriers to
Extending Usage
Resources Split Among
Modeling and Simulation | Big
Data Analytics | Machine
Learning | Visualization
HPC
Optimized
Democratization at Every
Scale | Cloud Access |
Exploration of New Parallel
Programming Models
Big
Datahpc
Machine learning
visualization
7. 11
What Makes a Great HPC Solution?
Parallel File SystemSwitch Fabric
Login and
Management Nodes
. . .
Actual configurations depend on specific OEM offerings and implementation.
Intel® Omni-Path Fabric
1GbE for
administration
IBA
10/40 GbE
Networking
Gateways
Intel® Software Tools
Intel® Parallel Studio
Intel® Node Manager
Intel® Trace Analyzer
I/O Nodes
Intel® Networking
Intel® Omni-Path Fabric
Intel® Silicon Photonics
Burst Buffer
Intel® Xeon® Processors
Intel® Omni-Path Fabric
Intel® Optane™
Technology
Compute Nodes
Intel® Compute
Intel® Xeon Phi™ Processors
Intel® Xeon® Processors
Intel® Optane™ Technology
Intel® Omni-Path Fabric
Intel® Solutions for Lustre*
Intel® Enterprise Edition for Lustre*
Intel® Foundation Edition for Lustre*
Intel® Cloud Edition for Lustre*
Reference Architecture
Intel® Cluster Ready
Intel® Scalable
System Framework
8. 3
A Holistic Architectural Approach is Required
Compute
Memory
Fabric
Storage
PERFORMANCEICAPABILITY
TIME
System
Software
Innovative Technologies Tighter Integration
Application
Modernized Code
Community
ISV
Proprietary
System
Memory
Cores
Graphics
Fabric
FPGA
I/O
9. 5
Intel® Scalable System Framework
A Holistic Design Solution for All HPC Needs
Small Clusters Through Supercomputers
Compute and Data-Centric Computing
Standards-Based Programmability
On-Premise and Cloud-Based
Intel® Xeon® Processors
Intel® Xeon Phi™ Processors
Intel® Xeon Phi™ Coprocessors
Intel® Server Boards and Platforms
Intel® Solutions for Lustre*
Intel® Optane™ Technology
3D XPoint™ Technology
Intel® SSDs
Intel® Omni-Path Architecture
Intel® True Scale Fabric
Intel® Ethernet
Intel® Silicon Photonics
HPC System Software Stack
Intel® Software Tools
Intel® Cluster Ready Program
Intel Supported SDVis
Compute Memory/Storage
Fabric Software
Intel Silicon
Photonics
11. Intel®Xeon®ScalableplatformThe foundation of Data Center Innovation:
Agile & Trusted Infrastructure
delivers1.65xaverageperformanceboostoverpriorGeneration1
11
1 Up to 1.65x Geomean based on Normalized Generational Performance going from Intel® Xeon® processor E5-26xx v4 to Intel® Xeon® Scalable processor (estimated based on Intel internal testing of OLTP
Brokerage, SAP SD 2-Tier, HammerDB, Server-side Java, SPEC*int_rate_base2006, SPEC*fp_rate_base2006, Server Virtualization, STREAM* triad, LAMMPS, DPDK L3 Packet Forwarding, Black-Scholes, Intel
Distribution for LINPACK
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating
your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Intel does not control or audit the design or
implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are
reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.
Performance
Pervasive through compute,
storage, and network
Agility
Rapid service delivery
Security
Pervasive data security with near
zero performance overhead
12. 12
Typical2-socketconfiguration
CPU
x8
CPU
x8x4 x4
DMI 2
Intel®
QPI
Intel Xeon E5 v4 (2016) Purley (2017)
PCIe*
Four DDR4 memory channels
up to 24 DIMMs
Up to 80 PCIe lanes
Two QPI links (up to 9.6 GT/s)
Six DDR4 memory channels
up to 24 DIMMs
Up to 96 PCIe lanes
Two UPI links (up to 10.4 GT/s); up to 3 UPI links
in 4S and 8S configurations
Integrated Intel® Omni-Path Architecture (Fabric)
DDR4 DIMMs
PCIe* uplink connection for Intel® QuickAssist Technology and Intel® Ethernet**
CPU Intel®
UPI
LBG
DMI
3x16
PCIe* 1x100G
Intel® OP Fabric
x4
3x16
PCIe* 1x100G
Intel® OP Fabric
CPU
**
Intel Xeon Scalable (2017)
14. 14
Maximizes performance
Enables consistent, low latencies
Optimized for data sharing and
memory access between all CPU
cores/threads for ideal memory
bandwidth and capacity
Data flows scale efficiently for
2, 4 & 8+ socket configurations
Designed for modern virtualized and
hybrid cloud implementations
Designedfornext-generationDataCenters
Ring Architecture Mesh Architecture
2009-2017+ New in 2017
15. Re-ArchitectedL2&L3CacheHierarchy
Shared L3
2.5MB/core
(inclusive)
Core
L2
(256KB private)
Core
L2
(256KB private)
Core
L2
(256KB private)
Shared L3
1.375MB/core
(non-inclusive)
Core
L2
(1MB private)
Core
L2
(1MB private)
Core
L2
(1MB private)
Previous Architectures
Intel® Xeon® Scalable Processor
Architecture
• On-chip cache balance shifted from shared-distributed (prior architectures) to private-local (Skylake architecture):
• Shared-distributed shared-distributed L3 is primary cache
• Private-local private L2 becomes primary cache with shared L3 used as overflow cache
• Shared L3 changed from inclusive to non-inclusive:
• Inclusive (prior architectures) L3 has copies of all lines in L2
• Non-inclusive (Skylake architecture) lines in L2 may not exist in L3
Skylake-SPcachehierarchyarchitectedspecificallyforDatacenterusecase
15
16. Intel®Xeon®ScalableProcessorsforTechnicalComputing(HPC)
powerfulandbalancedperformancefor
diversehpcworkloads
Powerful performance
Up to 28 cores vs. 24 cores/22 cores (on Intel® Xeon® processor E7
v4 / Intel Xeon processor E5-2600 v4 families)
Intel® AVX-512 delivers up to 2X FLOPs/clock-cycle peak
performance capability optimized for HPC, data analytics, and
cryptography workloads1
New Intel® Mesh architecture with 3 Intel® Ultra Path Interconnect
lanes provides greater inter-CPU bandwidth for the most data-
hungry, latency-sensitive applications
Significantly increased memory and I/O bandwidth
Up to 1.5x gen-to-gen memory bandwidth increase per CPU (6
memory channels) for extremely large compute- and data-intensive
workloads
More IO bandwidth with 48 PCIe 3.0 lanes vs. 40 lanes on Intel Xeon
processor E5-2600 v4
Intel® Optane™ and Intel® 3D NAND solid state drives deliver
industry-leading combination of high throughput, low latency, high
quality of service (QoS), and ultra high endurance6 to break data
access bottlenecks
integratedinterconnectfor
compellingefficiency
Integrated Intel® Omni-Path
Architecture designed for
today’s HPC systems
Provides 100Gbps high-
bandwidth and low-latency fabric
for HPC clusters
Reduces number of required
switches and lowers fabric costs7,
freeing up budget for up to 24%
more compute nodes8
Denser 48-port switch chip
delivers a 33 percent increase9
over traditional InfiniBand switch,
resulting in power, space and
maintenance savings
convergedparallelprogramming
environmentforIntel®Xeon®scalable
processors&Intel®XeonPHi™processors
Highly integrated portfolio of
superior technologies and optimized
software tools ensures code
portability across IA solutions
Intel AVX-512 enables converged
programming environment for Intel Xeon
Scalable Processor and Intel® Xeon Phi™
Processor compute nodes
Intel® Modern Code Developer Program
enables the next decade of discovery
Intel® Parallel Studio XE 2017 upgrades
developer toolkit for HPC and technical
computing
Intel® HPC Orchestrator simplifies installation
and ongoing maintenance of HPC system
software stack
16
For footnotes and configurations, see slides 29-30.
17. 17
Intel®AdvancedVectorExtensions-512(AVX-512)End Customer Value: Workload-optimized performance, throughput increases, and H/W-enhanced security
improvements for familiar analytics, HPC, video transcode, cryptography, and compression software.
Problems Solved:
1. Achieve more work per cycle (doubles width of data registers)
2. Minimize latency & overhead (doubles the number of registers) with ultra-wide (512-bit) vector processing capabilities
(that that 2x FMA processing engines are available on Intel® Xeon® Platinum and Intel® Xeon® Gold Processors)
Up to 2xFLOPS/clock cycle1
Segments ProofpointsValuepillars
Accelerates performance for your most demanding computational tasks
Up to 4xgreater throughput2
performance security
Cloud Service
Providers
Comms Service
Providers
* FLOPs = Floating Point Operations
1 Peak performance vs. Intel® AVX2. As measured by Intel® Xeon® Processor Scalable Family with Intel® AVX-512 compared to an Intel® Xeon® E5 v4 with Intel® AVX2
2 Vectorized floating-point throughput. As measured by Intel® Xeon® Processor Scalable Family with Intel® AVX-512 compared to an Intel® Xeon® E5 v4 with Intel® AVX2
Enterprise
19. PerformanceandEfficiencywithIntel®AVX-512
Source as of June 2017: Intel internal measurements on platform with Xeon Platinum 8180, Turbo enabled, UPI=10.4, SNC1, 6x32GB DDR4-2666 per CPU, 1 DPC. Software and workloads used in performance
tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software,
operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.
669
1178
2034
3259
760 768 791 767
3.1
2.8
2.5
2.1
0
0.5
1
1.5
2
2.5
3
3.5
0
500
1000
1500
2000
2500
3000
3500
SSE4.2 AVX AVX2 AVX512
CoreFrequency
GFLOPs,SystemPower
LINPACK Performance
GFLOPs Power (W) Frequency (GHz)
1.00
1.74
2.92
4.83
0.00
1.00
2.00
3.00
4.00
5.00
6.00
SSE4.2 AVX AVX2 AVX512
NormalizedtoSSE4.2
GFLOPs/Watt
GFLOPs / Watt
1.00
1.95
3.77
7.19
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
SSE4.2 AVX AVX2 AVX512
NormalizedtoSSE4.2
GFLOPs/GHz
GFLOPs / GHz
Intel®AVX-512deliverssignificantperformanceandefficiencygains
19
21. Intel® Omni-Path
Architecture
In 30 secs
21
The Interconnect Landscape: Why Intel® OPA?
1 Source: Internal analysis based on a 256-node to 2048-node clusters configured with Mellanox FDR and EDR InfiniBand products. Mellanox component pricing from www.kernelsoftware.com Prices as of November 3, 2015. Compute node pricing
based on Dell PowerEdge R730 server from www.dell.com. Prices as of May 26, 2015. Intel® OPA (x8) utilizes a 2-1 over-subscribed Fabric. Intel® OPA pricing based on estimated reseller pricing using projected Intel MSRP pricing on day of launch.
Performance
I/O struggling to keep up with
CPU innovation
Increasing Scale
From 10K nodes….to
200K+
Previous solutions reaching limits
of scalability, manageability and
reliability
Fabric: Cluster Budget1
Fabric an increasing % of HPC
hardware costs
21 3
SU14
1 2 3
SU15
1 2 3
SU16
1 2 3
SU17
1 2 3
SU18
1 2 3
SU10
1 2 3
SU11
1 2 3
SU12
1 2 3
SU13
1 2 3
SU05
1 2 3
SU06
1 2 3
SU07
1 2 3
SU08
1 2 3
SU09
1 2 3
SU01
1 2 3
SU02
1 2 3
SU03
1 2 3
SU04
1 2 3
Tomorrow
30 to 40%
Today
20%-30%
Goal: Keep cluster costs in check maximize COMPUTE power per dollar
22. 7
Intel® Omni-Path Architecture
The Future of High Performance Fabrics
Better Scaling vs EDR
48 Radix Chip Ports
Up to 26% More Servers than InfiniBand* EDR within the Same Budget1
Up to 60% Lower Power and Cooling Costs2
Configurable / Resilient
Job Prioritization (Traffic Flow Optimization)
No-Compromise Resiliency (Packet Integrity Protection and Dynamic Lane Scaling)
Market Adoption
>100 OEM and HPC Storage Vendor Offerings Expected for Platforms, Switches,
and Adapters3
Intel®
Omni-Path
Architecture
HPC’s
NextGeneration
Fabric
1. Assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of 648-port director switches and 36-port edge switches. Mellanox
componentpricing from www.kernelsoftware.com, with prices as of November 3, 2015.Computenode pricing based onDellPowerEdge R730 server from www.dell.com,with prices as of May 26,2015.Intel®OPA pricing based onestimated resellerpricing based on Intel MSRP pricing on ark.intel.com. 2. Assumes a 750-
node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of director switches and edge switches. Mellanox power data based on Mellanox CS7500
DirectorSwitch, MellanoxSB7700/SB7790Edgeswitch, and MellanoxConnectX-4VPI adapter card installation documentationposted on www.mellanox.comas ofNovember 1,2015. IntelOPA power databased on productbriefs postedon www.intel.comasofNovember16, 2015.Intel®OPA pricing based onestimated
reseller pricing based on Intel MSRP pricing on ark.intel.com. 3. Intel internal information. Design win count based on OEM and HPC storage vendors who are planning to offer either Intel-branded or custom switch products, along with the total number of OEM platforms that are currently planned to support custom
and/or standardIntel®OPA adapters. Design win countas ofNovember 1,2015 and subjectto changewithout noticebased on vendorproductplans.*Othernamesand brands maybe claimed as property of others.
Intel® Scalable
System Framework
23. 600
500
400
300
200
100
0
SwitchChipsRequired
Nodes
Intel® OPA
48-port switch
InfiniBand*
36-port switch
FEWER
SWITCHES
REQUIRED
1. Assumes a 750-node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of 648-port director switches and 36-port edge switches. Mellanox
component pricing from www.kernelsoftware.com, with prices as of November 3, 2015. Compute node pricing based on Dell PowerEdge R730 server from www.dell.com, with prices as of May 26, 2015. Intel® OPA pricing based on estimated reseller pricing based on Intel MSRP pricing on ark.intel.com. 2. Assumes a 750-
node cluster, and number of switch chips required is based on a full bisectional bandwidth (FBB) Fat-Tree configuration. Intel® OPA uses one fully-populated 768-port director switch, and Mellanox EDR solution uses a combination of director switches and edge switches. Mellanox power data based on Mellanox CS7500
Director Switch, Mellanox SB7700/SB7790 Edge switch, and Mellanox ConnectX-4 VPI adapter card installation documentation posted on www.mellanox.com as of November 1, 2015. Intel OPA power data based on product briefs posted on www.intel.com as of November 16, 2015. Intel® OPA pricing based on estimated
resellerpricing based onIntelMSRP pricing onark.intel.com.3Numberof switch chips required, switch density,and fabric scalability are based ona fullbisectional bandwidth (FBB) Fat-Tree configuration,using a48-portswitch for Intel®Omni-PathArchitectureand 36-portswitchASICforeither Mellanoxor Intel® True
ScaleFabric. *Othernamesand brands maybe claimed asthe property ofothers. 2.3Xfabric scalability based on a27,648-nodeclusterconfiguredwith the Intel®Omni-Path Architectureusing48-portswitch ASICs,ascompared with a36-port switch chip thatcansupport upto11,664 nodes.
26%More
Servers
than EDR1
60%Lower
Cooling
Costs2
2.3XGreater
Fabric
Scalability3
7
Intel® Omni-Path Architecture
HPC’s Next-Generation Fabric Intel® Scalable
System Framework
24. Intel® Omni-Path
Architecture
Xeon Phi™
Processor-F
(KNL-F)
Maximizing Support for Heterogeneous Clusters
Intel Xeon
Processor
(HSW, BDW
& SKL)
PCI
Card
Xeon Phi™
Processor
(KNL)
HFI
Greater flexibility for creating compute islands depending on user requirements
24
WFR HFI
Intel Xeon
Processor-F
(SKL-F)
HFI
WFR HFI
Intel Xeon
Processor-F
(SKL-F)
HFI
GPU GPU
GPU memory GPU memory
PCI bus
Intel Xeon
Processor
(SKL)
GPU Direct v3 provided in
Intel® OPA 10.3 release
PCI
Card
PCI
Card
WFR HFI
25. Intel® Omni-Path
Architecture
Next Up for Intel® OPA: Artificial Intelligence
Intel offers a complete AI Portfolio
From CPUs to software to computer vision to
libraries and tools
Intel® OPA offers breakthrough
performance on scale-out apps
Low latency
High bandwidth
High message rate
GPU Direct RDMA support
Xeon Phi Integration
25
Things
&devices
Cloud
DATACenter
Accelerant
Technologies
World-class interconnect solution for shorter time to train
26. Intel® Omni-Path
Architecture
NVMe* over OPA
Intel® OPA + Intel® SSD and Optane™
Technology
High Endurance
Low latency
High Efficiency
Complete NVMe over Fabric Solution
NVMe-over-OPA status
Supported in 10.4.3 IFS release
Compliant with NVMeF spec 1.0
Target and Host system configuration: 2 x Intel® Xeon® CPU E5-2699 v3 @ 2.30Ghz, Intel® Server Board S2600WT, 128GB DDR4, CentOS 7.3.1611, kernel 4.10.12, IFS 10.4.1, NULL-
BLK, FIO 2.19 options hfi1 krcvqs=8 sge_copy_mode=2 wss_threshold=70
26
*Other names and brands may be claimed as the property of others.
Only Intel is delivering a total NVMe over Fabric solution!
NVMe Host
Driver
RDMA
Transport
Intel®
OPA HFI
NVMe Host
Driver
NVMe Target
Driver
RDMA
Transport
NVMe
Storage
Intel®
OPA HFI
Host Target
PCIe
Transport
~1.5M 4k Random IOPS
99% Bandwidth Efficiency
28. 9
Tighter System-Level Integration
Innovative Memory-Storage Hierarchy
*cache, memory or hybrid mode
Compute
Node
Processor
Memory Bus
I/O Node
Remote
Storage
Compute
Today
Caches
Local Memory
Local Storage
Parallel File System
(Hard Drive Storage)
HigherBandwidth.
LowerLatencyandCapacity
Much larger memory capacities
keep data in local memory
Local memory is now faster
& in processor package
Compute
Future
Caches
Intel® DIMMs based on
3D XPoint™ Technology
Burst Buffer Node with
Intel® Optane™ Technology SSDs
Parallel File System
(Hard Drive Storage)
On-Package High
Bandwidth Memory*
SSD Storage
Intel® Optane™ Technology
SSDsI/O Node storage moves
to compute node
Some remote data moves
onto I/O node
Local
Memory
Intel® Scalable
System Framework
29. 4
Bridging the Memory-Storage Gap
Intel® Optane™ Technology Based on 3D XPoint™
SSD
Intel® Optane™ SSDs 5-7x Current Flagship
NAND-Based SSDs (IOPS)1
DRAM-like performance
Intel® DIMMs Based on 3D-XPoint™
1,000x Faster than NAND1
1,000x the Endurance of NAND2
Hard drive capacities
10x More Dense than Conventional
Memory3
1Performancedifferencebased oncomparison between 3DXPoint™ Technologyandother industryNAND
2Densitydifference based oncomparison between 3DXPoint™ Technologyandother industryDRAM
2Endurancedifference based oncomparison between 3DXPoint™ Technologyandother industryNAND
Intel® Scalable
System Framework
30. 30NVM SOLUTIONS GROUP 30NVM SOLUTIONS GROUP
Technology claims are based on comparisons of latency, density and write cycling metrics amongst memory technologies recorded on published specifications of in-market memory products
against internal Intel specifications. Intel® Optane™ SSD prototype compared to the Intel® SSD DC P3700 Series (NAND)
Intel® Optane™ SSDs for Data Center
Technology claims are based on comparisons of latency, density and write cycling metrics amongst memory technologies recorded on published specifications of in-market memory products against internal Intel specifications.
Intel® Optane™ SSD prototype compared to the Intel® SSD DC P3700 Series (NAND)
=
Ultra-high
Endurance
Responsive Under Load
Low Latency
Predictably Fast Service
QoS
Breakthrough
Performance
IOPS
31. NVM Solutions Group 31
Intel® Optane™ SSD Use Cases
DRAM
PCIe*
PCIe
Intel® 3D NAND SSDs
Intel®
Optane™ SSD
Fast Storage and Cache
Intel®
Xeon®
‘memory
pool’DRAM
PCIe
Intel® 3D NAND SSDs
Intel® Optane™
SSD
DDR
DDR
PCIe
Extend Memory
Intel®
Xeon®
*Other names and brands may be claimed as the property of others
32. NVM Solutions Group 32
5-8x faster at low Queue
Depths1
Vast majority of applications
generate low QD storage
workloads
1. Common Configuration - Intel 2U Server System, OS CentOS 7.2, kernel 3.10.0-327.el7.x86_64, CPU 2 x Intel® Xeon® E5-2699 v4 @ 2.20GHz (22 cores), RAM 396GB DDR @ 2133MHz. Configuration – Intel® Optane™ SSD
DC P4800X 375GB and Intel® SSD DC P3700 1600GB. Performance – measured under 4K 70-30 workload at QD1-16 using fio-2.15.
Breakthrough Performance
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance.
33. NVM Solutions Group 33
up to 60x better at 99% QoS1
Ideal for critical applications
with aggressive latency
requirements
1. Common Configuration – Intel 2U Server System, OS CentOS 7.2, kernel 3.10.0-327.el7.x86_64, CPU 2 x Intel® Xeon® E5-2699 v4 @ 2.20GHz (22 cores), RAM 396GB DDR @ 2133MHz. Configuration – Intel® Optane™ SSD
DC P4800X 375GB and Intel® SSD DC P3700 1600GB. QoS – measures 99% QoS under 4K 70-30 workload at QD1 using fio-2.15.
Predictably Fast Service
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance.
34. NVM Solutions Group 34
Ultra Endurance
MLC/TLC
2D/3D NAND SSD
Intel® Optane™ SSD
Endurance
(DWPD)
0.5
3
30
Up to 10x more Total Bytes
Written at similar capacity1
Architected for endurance scaling
‘Write in place’ technology
Non-destructive write process
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance.
1. Comparing projected Intel® Optane™ SSD 750GB specifications to actual Intel® SSD DC P4600 1.6TB specifications.
Total Bytes Written (TBW) calculated by multiplying specified or projected DWPD x specified or projected warranty duration x 365 days/year.
36. 36
By2020…
The average internet user will generate
~1.5GBoftrafficperday
Smart hospitals will generate over
3,000GBperday
Self driving cars will be generating over
4,000GBperday…each
All numbers are approximated
http://www.cisco.com/c/en/us/solutions/service-provider/vni-network-traffic-forecast/infographic.html
http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html
https://datafloq.com/read/self-driving-cars-create-2-petabytes-data-annually/172
http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html
http://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/Cloud_Index_White_Paper.html
A connected plane will generate over
40,000GBperday
A connected factory will generate over
1,000,000GBperday
radar ~10-100KB persecond
sonar ~10-100KB persecond
gps ~50KB persecond
lidar ~10-70MB persecond
cameras ~20-40MB persecond
Self driving cars will generate over
4,000GBperday…each
Thecomingfloodofdata
37. 37
Analyticsneedsai
Hindsight
What Happened
Insight
What Happened and Why
Foresight
What Will Happen,
When, and Why
Simulation-Driven Analysis
and Decision-Making
Self-Learning and Completely Automated Enterprise
Mature Data Lake
Computerized Human Thought Simulation and Actions
Towards Autonomic Enterprise
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive
Analytics
Cognitive
Analytics
AI
is a large category
all on its own,
and a vital tool for
reaching higher
maturity & scale
data analytics
Advanced Analytics
Operational Analytics
TodayEmerging
41. 41
✝Codename for product that is coming soon
All performance positioning claims are relative to other processor technologies in Intel’s AI datacenter portfolio
*Knights Mill (KNM); select = single-precision highly-parallel workloads generally scale to >100 threads and benefit from more vectorization, and may also benefit from greater memory bandwidth e.g. energy (reverse time migration), deep learning training, etc.
All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
AI Datacenter
Allpurpose Highly-parallel Flexibleacceleration DeepLearning
Crest
Family✝
Deeplearningbydesign
Scalable acceleration with
best performance for
intensive deep learning
training & inference
Intel®
FPGA
EnhancedDLInference
Scalable acceleration for deep
learning inference in real-time
with higher efficiency, and
wide range of workloads &
configurations
Intel® Xeon®
Processor Family
Training&Inference
Scalable performance for
widest variety of AI & other
datacenter workloads –
including deep learning
training & inference
Intel® Xeon Phi™
Processor (Knights Mill✝)
FasterDLTraining
Scalable performance
optimized for even faster
deep learning training and
select highly-parallel
datacenter workloads*
✝
42. MostagileAIplatform
Intel®Xeon®ScalableprocessorsforAI
Scalable performance for widest variety of AI & other datacenter workloads – including deep learning
Built-inROI
Begin your AI journey today using
existing, familiar infrastructure
Potentperformance
Train in days HOURS with up to 113X2 perf
vs. Intel Xeon E5 v3 (2.2x excluding optimized SW1)
Production-ready
Robust support for full range of
AI deployments
1,2Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components,
software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the
performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016. Optimization Notice: Intel's compilers may or may not
optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not
specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice Revision #20110804. See slide 15 for configuration details.
42
43. 4343
Intel®Xeon®Inference&trainingperformance
INFERENCE THROUGHPUT
Up to
2.4x
Intel® Xeon® Platinum 8180 Processor
higher Neon ResNet 18 inference throughput
compared to
Intel® Xeon® Processor E5-2699 v4
TRAINING THROUGHPUT
Up to
2.2x
Intel® Xeon® Platinum 8180 Processor
higher Neon ResNet 18 training throughput
compared to
Intel® Xeon® Processor E5-2699 v4
Advance previous generation AI workload performance with Intel® Xeon® Scalable Processors
Inference throughput batch size: 1 Training throughput batch size: 256 Configuration Details on Slide: 18, 20 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with
other products. For more complete information visit http://www.intel.com/performance Source: Intel measured as of June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations
include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Inference and training throughput measured with FP32 instructions. Inference with INT8 will be higher.
44. 4444
Intel®Xeon®PlatformPerformance
INFERENCE THROUGHPUT
Up to
138x
Intel® Xeon® Platinum 8180 Processor
higher Intel optimized Caffe GoogleNet v1 with Intel® MKL
inference throughput compared to
Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe
INFERENCE using FP32 Batch Size Caffe GoogleNet v1 256 AlexNet 256 Configuration Details on Slide: 18, 25
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause
the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of
June 2017 Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability,
functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
TRAINING THROUGHPUT
Up to
113x
Intel® Xeon® Platinum 8180 Processor
higher Intel Optimized Caffe AlexNet with Intel® MKL
training throughput compared to
Intel® Xeon® Processor E5-2699 v3 with BVLC-Caffe
Deliver significant AI performance with hardware and software optimizations on Intel® Xeon® Scalable Processors
Optimized
Frameworks
Optimized Intel®
MKL Libraries
Inference and training throughput measured with FP32 instructions. Inference with INT8 will be higher.
Hardware plus optimized software
45. 45
Scalable performance
optimized for even faster deep
learning training and select
highly-parallel datacenter
workloads*
Intel®XeonPhi™processor(KnightsMill)
Delivers up to 4Xdeep learning
performance over Knights Landing✝
New instructions sets deliver enhanced
lower precision performance
Time-to-train reduction is the primary
benchmark to judge deep learning
training performance
Direct access of up to 400 GB of memory
with no PCIe performance lag (vs.
GPU:16GB)
Efficient scaling further reduces time-to-
train when utilizing scaled Knights Mill
systems
Up to 400Xdeep learning performance
on existing HW via Intel SW optimization
Share deep learning software investments
across Intel Platforms via Intel deep
learning software tools
Binary-compatible with Intel® Xeon®
processor
Fastertime-to-train Efficientscaling Futureready
✝Knights Landing is the former codename for the Intel® Xeon Phi™ processor family that was released in 2016
Configuration details on final slides
*Knights Mill (KNM); select = single-precision highly-parallel workloads generally scale to >100 threads and benefit from more vectorization, and may also benefit from greater memory bandwidth e.g. energy (reverse time migration), deep learning training, etc.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations
and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when
combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice Revision #20110804
Faster
DLTraining
Highly-parallel
46. 46
Deeplearning
Bydesign
Scalable acceleration with best
performance for intensive deep
learning training & inference,
period
Crestfamily
Unprecedented compute density
Large reduction in time-to-train
32 GB of in package memory via
HBM2 technology
8 Tera-bits/s of memory access
speed
12 bi-directional high-bandwidth
links
Seamless data transfer
via interconnects
Customhardware Blazingdataaccess High-speedscalability
1Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations
and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined
with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice Revision #20110804
2017
47. 47
optimizedforIntelarchitecture
BigDL MLliB
Aiframeworks
and more frameworks enabled via Intel® Nervana™ Graph (future)
See Roadmap
for availability
Other names and brands may be claimed as the property of others.
Intel®'s reference deep
learning framework
committed to best
performance on all
hardware
intelnervana.com/neon