Semiconductors are the driving force behind the AI evolution and enable its adoption across various application areas ranging from connected and automated driving to smart healthcare and wearables. Given that, electronics research, design and manufacturing communities around the world are increasingly investing in specialized AI chips providing less latency, greater processing power, higher bandwidth and faster performance. AI also attracts new technology players to invest in making their own specialized AI chips, changing the electronics manufacturing landscape and moving the AI technology towards machine learning, deep learning and neural networks.
4. 4
CUDA DEVELOPMENT ECOSYSTEM
CUDA: Programming Model, GPU Architecture, System Architecture
SpecializedPerformanceEase of use
FrameworksApplications Libraries
Directives and
Standard Languages
Extended Standard
Languages
CUDA-C++
CUDA Fortran
GPU Users Domain
Specialists
Problem
Specialists
New Algorithm Developersand
OptimizationExperts
7. 7
TESLA V100 32GB
THE MOST ADVANCED DATA CENTER GPU EVER BUILT
5,120 CUDA cores
640 NEW Tensor cores
7.5 FP64 TFLOPS | 15 FP32 TFLOPS
120 Tensor TFLOP
20MB SM RF | 16MB Cache | 32GB HBM2 @ 900 GB/s
300 GB/s NVLink
https://devblogs.nvidia.com/parallelforall/inside-volta/
8. 8
Tensor Cores: More resources
Further information
Mixed-precision blog: https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/
Mixed-precision best practices: https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html
Mixed-precision arVix paper: https://arxiv.org/abs/1710.03740
GTC 2018 Sessions: Training with Mixed Precision: Theory and Practice and Training with Mixed Precision: Real Examples
Call to action:
1) Share these resources
2) Provide feedback to help us prioritize new examples and improve examples
DGX/NGC deep learning framework containers: Latest versions of software stack and Tensor Core optimized examples - DGX/NGC Registry
Tools
TensorFlow OpenSeq2seq: https://nvidia.github.io/OpenSeq2Seq/html/mixed-precision.html & arVix paper
PyTorch Apex: https://nvidia.github.io/apex/fp16_utils.html & NVIDIA developer news article
Examples
New mixed-precision model examples: https://developer.nvidia.com/deep-learning-examples
GitHub: https://github.com/NVIDIA/DeepLearningExamples
9. 9
Li et al, University of Maryland & US Naval Academy
https://arxiv.org/pdf/1712.09913.pdf
12. 12
NVIDIA TensorRT
Optimize and Deploy neural networks in
production environments
Maximize throughput for latency-critical apps
with optimizer and runtime
Deploy responsive and memory efficient apps
with INT8 & FP16 optimizations
Accelerate every framework with TensorFlow
integration and ONNX support
Run multiple models on a node with
containerized inference server
Platform for High-PerformanceDeep LearningInference
developer.nvidia.com/tensorrt
TensorRT
Optimizer
TensorRT
Runtime
Engine
Trained
Neural
Network
Embedded Automotive Data center
Jetson Drive PX Tesla
13. 13
320 Turing Tensor Cores
2,560 CUDACores
65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS
16GB | 320GB/s
70 W
Now Available Early Access
TESLA T4 ON GCP
The Cloud GPU
14. 14
WORLD’S MOST ADVANCED
INFERENCE GPU
INTEGRATED INTO TENSORFLOW&
ONNX SUPPORT
TENSORRT HYPERSCALE INFERENCE PLATFORM
TENSORRT
INFERENCE SERVER
15. 15NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
NVIDIA DEEP LEARNING SDK UPDATE
GPU-accelerated
DL Primitives
Support for Turing
Optimizations for RNNs
Extended support for NHWC
cuDNN
7.4
Multi-node distributed
training (multiple machines)
Leading frameworks support
Multi-GPU &
Multi-node
NCCL
2.3
TensorFlow model reader
Turing & Xavier support
INT8 RNNs support
High-performance
Inference Engine
TensorRT 5
DOWNLOAD NOW
16. 16
NEW PLATFORM SUPPORT
Deploy deep learning inference on
Xavier platforms through NVIDIA
DRIVE AI Platforms
Available for development and
host environments
NVIDIA XAVIER
Use AI from hospitals to trading floors
to factory floors
Support for C++ API and Python API
Available as an RPM file
CENTOS OPERATING SYSTEM
Deploy deep learning inference
on DLA platforms
Support for FP16 precision in
current version
WW.NVDLA.ORG
Use AI for games and medical apps
Support for C++ API
Available as a ZIP file
WINDOWS OPERATING SYSTEM
24. 24
ISAAC
Simulation Engine
Photo-realistic Graphics ∙Physics ∙Soft bodies ∙
∙ Procedural Generation ∙ Massive parallelism
∙ Unreal Engine 4 / Unity 3D
Isaac Framework
Codelets ∙Behaviors ∙ 3D Poses ∙Distributed
∙ Messaging ∙Synchronization ∙ Record & Replay
∙ Configuration ∙Visualization
World model
Warehouse ∙Office
∙ Store ∙ Home
Robot model
Carter ∙ URDF loader
Virtual Sensors Virtual Actuators Sensor Processing Actuator Control HW Sensor HW Actuator
ML
TensorRT∙ CUDA
∙ Tensorflow ∙...
Gems
Optimizers ∙ Algebra
∙ EKFs ∙ Depth ∙ ...
Navigation
Behaviors
Interactions
withhumans
Manipulation
Perception
Drivers
Lidar ∙ Camera ∙IMU ∙
RobotBase ∙ ...
Unified Message API
Use the same messages for simulation,
actual hardware and across all apps
Simulate DeployDevelop
Jetson
Fully integrated with
X2 and Xavier
27. 27
TensorRT SUPPORTS
Xavier
Deploy deep learning inference on Xavier platforms
through NVIDIA DRIVE AI Platforms
Import models in any framework (including TensorFlow,
Caffe and Torch) through ONNX, Universal Framework
Format or custom C/C++ API
Optimize CNN, RNN and novel neural network layers and
deploy reduced precision on Tensor Cores
Download for development or host environment today
Optimized Inference on the
World’s Most Powerful SoC
developer.nvidia.com/tensorrt
Lidar
Localization
Surround Perception
RADAR LIDAR
Egomotion Path
Planning
Camera
Localization
Lanes Signs Lights
Path
Perception
LIDAR
Localization
28. 28
XAVIER DLA
NOW OPEN SOURCE
Command Interface
Tensor Execution Micro-controller
Memory Interface
Input DMA
(Activations
and Weights)
Unified
512KB
Input
Buffer
Activations
and
Weights
Sparse Weight
Decompression
Native
Winograd
Input
Transform
MAC
Array
2048 Int8
or
1024 Int16
or
1024 FP16
Output
Accumulators
Output
Postprocess
or
(Activation
Function,
Pooling
etc.)
Output
DMA
WWW.NVDLA.ORG
29. 29
NVIDIA AGX
Family of Systems for
Embedded AI HPC
Self-driving cars
Robotics
Smart Cities
Healthcare
30. 30
END TO END SYSTEM FOR AV
Sources: NVIDIA, RAND Corporation
SIMULATETRAIN MODELSCOLLECT DATA RE-SIMULATE
Lanes Lights
Path
Signs
PedestriansCars
Up to 2 petabytes per
car per year
20 DNNs
1+ million images
per DNN
10B+ miles to
ensure safe driving
Validate and verify
for self-driving cars
by 2020
MAPPING
5 + DNNs
Create and update
HD maps
31. 31
COMPUTATIONAL SCALE REQUIRED
3 million labeled images
1 DGX-1 trains 300k labeled images on 1 DNN in 1 day
10 DNNs required for self-driving
10 parallel experimentsat all times
100 DGX-1 per car
33. 33
TURING MULTI-PROCESS SERVICE
Improved Small Batch Inference Throughput
Hardware
Accelerated
Work Submission
Hardware
Isolation
TURING MULTI-PROCESS SERVICE
Turing
A B C
CUDA MULTI-PROCESS SERVICE CONTROL
CPU Processes
GPU Execution
Turing MPS:
Inherits Volta’s enhanced MPS
architecture
• Reduced launch latency
• Improved launch throughput
• Improved quality of service with
scheduler partitioning
• 3x more clients than Pascal
A B C
36. Source: J. Zhang, KLA-Tencor, “Physics-BasedAI forSemiconductorInspectionUsing a GPU BasedOpticalNeural
Network,” GTC2018
Physics-based AI for Semiconductor Wafer Inspection
KLA-Tencor patent pending
37. 37
AI-BUILD AI TO
FABRICATE
SUBATOMIC MATERIALS
To expand the benefits of deep learning for science,
researchers need new tools to build high-performing
neural networks that don’t require specialized
knowledge. Scientists at Oak Ridge National
Laboratory used the MENNDL algorithm on
Summit to develop a neural network that
analyzes electron microscopy data at the
atomic level. The team achieved a
speed of 152.5 petaflops across
3,000 nodes.
39. 39
GPU ACCELERATED COMPUTING
HPC AI ML VISUALIZATION
CLARA NVIDIA AI RAPIDS ISAAC DRIVE METROPOLIS
WORKSTATIONS SERVERS CLOUD
CUDA & GPU COMPUTING ARCHITECTURE
NVIDIA GPU CLOUD
TESLA DGXTEGRA
40. 40
RE-IMAGINING DATA SCIENCE WORKFLOW
Open Source, End-to-end GPU-accelerated Workflow Built On CUDA
Data
preparation /
wrangling
cuDF
Optimized ML
model
training
cuML Visualization
Data
visualization
libraries
data insights
WWW.RAPIDS.AI
42. 42
ENTERPRISE DATA MODEL SCALABILITY
DGX-2
DGX-1
DGX STATION
2*RTX 8000
TESLA V100
256GB
128GB
96GB
32GB
512 GB
10240 TENSORCORES
5120 TENSORCORES
2560 TENSORCORES
1152 TENSORCORES
640 TENSORCORES
43. DGX POD
ARCHITECTURE
a single data center
rack containing up
to 9x NVIDIA DGX-1
servers, storage,
networking &
NVIDIA AI software
Nine DGX-1 servers
12 storage servers
10 GbE (min) storage
& management
switch
Mellanox 100 Gpps
intra-rack high speed
network switches.
44. 44
WHY CONTAINERS?
Benefits of Containers:
Simplify deployment of
GPU-accelerated software, eliminating time-
consuming software integration work
Isolate individual deep learning frameworks
and applications
Share, collaborate,
and test applications across
different environments
44
To learn more:
nvidia.com/ngc
To sign up:
ngc.nvidia.com
46. 46
POWERING THE WORLD’S FASTEST
SUPERCOMPUTERS
48% More Systems | 22 of Top 25 Greenest
Piz Daint
Europe’s Fastest
5,704 GPUs| 21 PF
ORNL Summit
World’s Fastest
27,648 GPUs| 144 PF
ABCI
Japan’s Fastest
4,352 GPUs| 20 PF
ENI HPC4
Fastest Industrial
3,200 GPUs| 12 PF
LLNL Sierra
World’s 2nd Fastest
17,280 GPUs| 95 PF
47.
48. 48
NEXT STEPS
NVIDIA Deep Learning Institute
www.nvidia.com/en-us/deep-learning-ai/education
GTC San Jose | March 26-29 2018
www.nvidia.com/en-us/gtc
NGC
www.nvidia.com/en-us/gpu-cloud