Weitere ähnliche Inhalte Ähnlich wie PowerAI Deep dive (20) Mehr von Ganesan Narayanasamy (20) Kürzlich hochgeladen (20) PowerAI Deep dive1. Florin Manaila
HPC/Deep Learning Architect and Inventor
IBM Cognitive Systems Europe
florin.manaila@de.ibm.com
August 31, 2018
IBM PowerAI Deep Learning Platform
(architecture, hardware roadmap, future innovation)
2. 2Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
AI Infrastructure Stack
Vision
Enterprise
L1-L3 Support
Base
Transform & Prep
Data (ETL)
Micro-Services / Applications
AI APIs
(Eg: Watson)
In-House APIs
Machine & Deep Learning
Libraries & Frameworks
Distributed Computing
Data Lake & Data Stores
Segment Specific:
Finance, Retail, Healthcare,
Automotive
Speech, Vision,
NLP, Sentiment
TensorFlow, Caffe,
Pytoch
SparkML, Snap.ML
Spark, MPI
Hadoop HDFS,
NoSQL DBs,
Parallel File
System
Accelerated
Infrastructure
3. 3
AI Infrastructure Stack Challenges
Transform & Prep
Data (ETL)
Micro-Services / Applications
AI APIs
(Eg: Watson)
In-House APIs
Machine & Deep Learning
Libraries & Frameworks
Distributed Computing
Data Lake & Data Stores
Data Prep, ETL, Curation,
Data Labeling
Performance to Reduce Training Time
Multi-tenant, GPU Virtualization,
DL Framework Scaling
Feature extraction, Selecting Right
Model, Hyper-parameter tuning
Finding Right “Tagged”
Data, Model Integrity
Use Case Identification,
Access to Enough Data
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
4. What’s in the training of deep neural networks?
Neural network model
Billions of parameters
Gigabytes
Computation
Iterative gradient based search
Millions of iterations
Mainly matrix operations
Data
Millions of images, sentences
Terabytes
Workload characteristics: Both compute and data intensive!
4Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
5. Deep Learning at work
Available options
5
Longer Training Time Shorter Training Time
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
6. Data processing stages for distributed deep learning
Training data
on storage
CPU:
Coordination
and data prep
GPU
computation
Parameter data
exchange
across systems
Network,
NVLink,
GPU Memory
POWER9
CPU
Storage
NVMe, SSD,
ESS
GPU
PCIe Gen. 4 2nd Gen
NVLink
Source: Hillery Hunter, IBM, GTC 2018
6Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
7. NVIDIA GPU implementation in AC922 Deep Learning System
NVLINK 2.0
Innovative Systems with NVLink 2.0:
• Faster GPU-GPU communication
• Breaks down barriers between CPU and GPU
• New system architectures
• Acceleration limited by PCIe Gen3
7Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
8. IBM AC922 Deep Learning System Architecture
AC922-GTG
8Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
9. IBM AC922 Deep Learning System Architecture
AC922-GTW
9Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
10. x86 GPU System vs IBM AC922 Deep Learning System
3D Image Segmentation Use Case
10
When factoring out this
inter-batch overhead the
NVLink 2.0 + Volta V100
combination is still 2.4x
faster than the PCIe Gen3
+ Volta V100 combination
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
11. Unified Memory with ATS on IBM POWER9
IBM POWER9 CPUs With NVLink Interconnect
11
ALLOCATION
Automatic access to all system memory: malloc,
stack, file system
ACCESS
All data accessible concurrently from any processor,
anytime
Atomic operations resolved directly over NVLink
ATS & POWER9 FEATURES
ATS allows GPUDirect RDMA to unified memory
Managed memory is cache-coherent between CPU
and GPU
CPU has direct access to GPU memory without need
for migration
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
12. IBM AC922 Deep Learning System
AC922-GTG
12Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
13. IBM AC922 Deep Learning System
AC922-GTW
13Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
14. IBM AC922 System
Options and Features
14
Processor Features
16 Core Processor Module
190W – 250W (2.25GHZ -
3.12GHZ)
20 Core Processor Module
190W – 250W (2.25GHZ -
2.80GHZ)
18 Core Processor Module
190W – 250W (2.98GHZ -
3.26GHZ)
22 Core Processor Module
190W – 250W (2.78GHZ -
3.07GHZ)
Memory Features
8GB IS RDIMM DDR4
16GB IS RDIMM DDR4
32GB IS RDIMM DDR4
64GB IS RDIMM DDR4
128GB IS RDIMM DDR4
Storage Features
HDD 1TB 2.5” 7k RPM SATA
HDD 2TB 2.5” 7k RPM SATA
SSD 960GB 2.5” SATA
SSD 1.92TB 2.5” SATA
SSD 3.84TB 2.5” SATA
1.6TB NVMe Flash Adapter
3.2TB NVMe Flash Adapter
6.4TB NVMe Flash Adapter
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
16. 16
PCIe Adapter Features
4-Port Ethernet (4x1 1Gb)
2-Port 40/100 GbE RoCE SFP+
2-Port Ethernet (10Gb)
4-Port Ethernet (2x10 10Gb Optical + 2x 1Gb)
4-Port Ethernet Cu (2x10 10Gb CU + 2x 1Gb)
2 Port 10Gb/s NIC & ROCE SR/CU
2 Port 25/10Gb/s NIC & ROCE SR/CU
1 Port EDR 100Gb IB CX-5 CAPI
2 Port EDR 100Gb IB CX-5 CAPI
2-Port Fiber Channel (16Gb/s)
2-Port Fiber Channel (32Gb/s)
Accelerators Features
NVIDIA V100 SMX2 16GB HBM2
NVIDIA V100 SMX2 32GB HBM2
Xilinix ADM-PCIE-8V3 FPGA
IBM AC922 System
Options and Features
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
18. 18
IBM AC922 Deep Learning System
Front and Rear View
RearViewFrontView
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
19. Volta SMX2 GPU Accelerator
Power Regulation
2x 400 Pin Connectors2x Grounding Pads
BottomSide
Multi Chip Module
NVIDIA GPU Details
19
TopSide
NVIDIA Volta Specifications
NVIDIA Tensor Cores 640
NVIDIA CUDA Cores 5120
Peak Double-Precision Performance 7.8 TFLOPS
Single-Precision Performance 15.7 TFLOPS
Tensor Performance 125 TFLOPS
Memory Bandwidth 900 GB/sec
GPU Memory Size 16 GB or 32GB
HBM2
NVLink “Bricks” (8 lane interface) 6
NVLink Interconnect Bi-Directional 300 GB/sec
Maximum Power 300W
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
20. 20
Server based FPGA: ie. ADM-PCIE-8V3
Features
• Board Format : Half-Length, low profile x16 PCIe form factor
• Host I/F : PCI Express Gen3 x8
• Target Device : Xilinx Virtex Ultrascale : XCVU095-2 - FFVC1517
• SDRAM : 2x banks of 1G x 72, DDR4-2400 (16GiB total),
upgradable to 16GiB, DDR4-1866 (dual bank devices), per bank (32
GiB total)
• FLASH : On-board re-programmable flash memory for embedded
configuration
• Optional integrated Board Support Package (BSP) including
extensive FPGA example designs, plug and play drivers, and a
mature Application Programming Interface (API)
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
21. 21
CAPI Advantages on AC922 Deep Learning System
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
22. Feature List:
REST Management
IPMI
SSH based SOL
Power and Cooling
Management
Event Logs
Zeroconf discoverable
Sensors
Features In
Progress:
Full IPMI 2.0
Compliance with DCMI
Verified Boot
HTML5 Java Script Web
User Interface
BMC RAS
IBM is the
OpenBMC
Community Leader
Facebook
Google
IBM
Intel
Microsoft
OCP
22
OpenBMC is a free open
source management
software Linux distribution
Inventory
LED Management
Host Watchdog
Simulation
Code Update Support for
multiple BMC/BIOS
images
POWER On Chip
Controller (OCC) SupportCognitive Systems Europe / August 31 / © 2018 IBM Corporation
25. IBM PowerAI at the glance
June, 2018 update
25Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
26. IBM PowerAI Base @hub.docker.com
26Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
27. IBM PowerAI Base usage at the glance
27
PowerAI framework activation (Python2 or Python3)
Activation scripts are used to manage system and python paths
To activate PowerAI deep learning frameworks:
$ source /opt/DL/<framework-name>/bin/<framework-name>-activate
This script sets PATH and PYTHONPATH to the appropriate values for the desired deep learning framework as it resides in
/opt/DL directory.
<framework>-activate will also call check_dependencies
Activation will only happen if all dependencies are met
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
28. What data science methods are used at work?
28Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
34. 34
libGLM (C++ / CUDA
Optimized Primitive Lib)
Distributed Training
Logistic Regression Linear Regression
Support Vector
Machines (SVM)
Distributed Hyper-
Parameter Optimization
More Coming Soon
APIs for Popular ML
Frameworks
IBM Snap ML part of PowerAI Base
Distributed GPU-Accelerated Machine Learning Library
(coming
soon)
Snap Machine Learning (ML) Library
34Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
35. 46x faster than previous
record set by Google
Workload: Click-through rate
prediction for advertising
Logistic Regression Classifier in
Snap ML using GPUs vs
TensorFlow using CPU-only
35
Snap ML: Training Time Goes
From An Hour to Minutes
Logistic Regression in Snap ML
(with GPUs) vs TensorFlow (CPU-
only)
1.1 Hours
1.53
Minutes
0
20
40
60
80
Google
CPU-only
Snap ML
Power + GPU
Runtime(Minutes)
46x Faster
Dataset: Criteo Terabyte Click Logs
(http://labs.criteo.com/2013/12/download-terabyte-click-logs/)
4 billion training examples, 1 million features
Model: Logistic Regression: TensorFlow vs Snap ML
Test LogLoss: 0.1293 (Google using Tensorflow), 0.1292 (Snap ML)
Platform: 89 CPU-only machines in Google using Tensorflow versus
4 AC922 servers (each 2 Power9 CPUs + 4 V100 GPUs) for Snap ML
Google data from this Google blog
90 x86 Servers
(CPU-only)
4 Power9 Servers
With GPUs
38. 38
Deep Learning Impact
(DLI) Module
Data & Model
Management, ETL,
Visualize, Advise
IBM Conductor with Spark
Cluster Virtualization,
Auto Hyper-Parameter Optimization
PowerAI: Open Source ML Frameworks
Large Model Support (LMS)
Distributed Deep
Learning (DDL)
Auto ML
Enterprise
Accelerated
Infrastructure
IBM PowerAI Enterprise V1.1
Announced on June, 2018
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
39. 39
Enterprise
IBM PowerAI Enterprise V1.1
Announced on June, 2018
Deep Learning Impact
Data Management and ETL
Training visualization and monitoring
Hyper-parameter optimization
Spectrum Conductor
Multi-tenancy support & security
User reporting & charge back
Dynamic resource allocation
External data connectors
Distributed Deep Learning (DDL)
Support Line L1-L3
Accelerated
Infrastructure
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
40. Real time monitoring of hyper parameters in PowerAI Enterprise
40Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
41. Hyper-parameter Tuning/Search in PowerAI Enterprise
41
Hyper-parameters
– Learning rate
– Decay rate
– Batch size
– Optimizer:
GradientDecedent,
Adadelta,
Momentum,
RMSProp
…..
– Momentum (for some
optimizers)
– LSTM hidden unit size (for
models which use LSTM)
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
43. Who are the typical Personas for computer vision solutions ?
43Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
44. Steps for Deep Learning Development
44
Define
training
task
Prepare
training Data
Data Pre-
processing
DNN Model
selection
Configure
the training
hyper-
parameter
DNN Model
Training
Start
Package the new
DNN model
together with
preprocessing into
inference proc.
Application
development with
inference API
DL training
framework
preparation
Danielle
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
45. How to Simplifying Deep Learning Adoption?
45
Format transformation
Support both training and evaluation sets
Support different pre-processing plugins
Provide base models for different scenarios
Predict training time
Training process visualization
Training with GPU
Scalability and HA deployment
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
46. IBM PowerAI Vision
Simplify Deep Learning Adoption
46
User could use the
deployed API for
visual recognition
PowerAI Vision
Iris
Danny
Define
training
task
Prepare
training Data
Data Pre-
processing
DNN Model
selection
Configure
the training
hyper-
parameter
DNN Model
Training
Package the new
DNN model
together with
preprocessing into
inference proc.
DL training
framework
preparation
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
47. What are we solving ?
47
Data
Up &
Running
Data Pre-
Processin
g
Build,
Train,
Optimize
Deploy &
Infer
Maintain
Model
Accuracy
Training
visualization &
accuracy
monitoring
Customize
parameters for
training
Datasets for
classification
Datasets for object
detections
Semi-auto labeling on
videos
Pre-bundled models
dataset creation
Data augmentation
REST APIs for
creating
datasets.
Export/Import
datasets
Custom DNN
models
Hyper-parameter
search and tuning
REST APIs to
infer with
images/videos
Inference
Engine for
compiling
accelerated
models on edge
Image Analyst
Data Scientist
Simplified
installation and
deployment
Developer
Deploy where
trained
Optimized
models for few
categories
Visualize
progress and
early warning
Customize models
for pre-processing
Use Interface is
deployed Validate trained
models
Built audit
systems based
low inference
scores
Most vendors
address this space
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
48. IBM PowerAI Vision
Lowers the barriers for creating Computer Vision related AI solutions.
48Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
49. IBM PowerAI Vision
Lowers the barriers for creating Computer Vision related AI solutions.
49Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
50. IBM PowerAI Vision
Lowers the barriers for creating Computer Vision related AI solutions.
50Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
51. Semi-Automatic Labeling from video content
51
Train DL Model
Define Labels
Manually Label Some
Images / Video Frames
Manually Label
Use Trained DL
Model
Run Trained DL Model
on Entire Input Data to
Generate Labels
Correct Labels on
Some Data
Manually Correct
Labels on Some Data
Repeat Till Labels Achieve Desired Accuracy
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
52. Delivered Pre-Trained Models
Time and Data Matters
52
Convolutional Neural Network (CNN)
Pre-trained CNN
New
Task
Fine-tune W
Mergus
Larus
….
Corvus
Sourav
Mergus
Larus
….
Corvus
Recreate
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
53. IBM PowerAI Vision: Deep Learning Development Platform for Computer Vision
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
54. PowerAI Vision APIs
Inference APIs for Object Detection (example)
54
Developer could use these APIs for object detection with the deployed model in PowerAI Vision from any IP device
http://IP:PORT/ (of the deployed inference instance)
/test
GET: Only to test if the monitor service is running.
/detect_url
GET: Upload image with image url and detect objects
/detect_upload
POST: Post image file and do the object detection
Inference return:
{'confidence': 0.9038739204406738, 'ymax': 145, 'label': 'badge', 'xmax': 172, 'xmin': 157, 'ymin': 123}
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
56. 56
CPU + GPU Neural network
processor
Embedded
GPU
Embedded FPGA
CPUs, GPUs
Trained
DNN model
DNN model parser
DNN model analyzer
NN structure
Backend specific
optimization
Estimate resources
& performance
Mapping to
back ends
PowerAI Inference Engine
Map to Different
Platforms
Data Center: Train model & Compile to Edge
Cloud or Edge
PowerAI Inference Engine (PIE)
Automatically Map Trained AI Models to Cloud or Edge
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
57. Inference at the edge How can I
accelerate
models for the
edge ?
-- Developer
Compile accelerated models for FPGAs, NVIDIA TX1/TX2* & Raspberry Pi*
*coming soon
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
58. 58
Edge FPGA: ie. TySOM-3 Embedded Prototyping Board
Features
TySOM-3-ZU7 is a compact prototyping board containing Zynq
UltraScale+ MPSoC device which provides 64-bit processor scalability
while combining real-time control with soft and hard engines for graphics,
video, waveform, and packet processing.
Xilinx Zynq UltraScale+ ZU7EV-FFVC1156 MPSoC contains a Video
Codec Unit which supports H.264/H.265, and also it has the biggest
FPGA in the UltraScale+™ MPSoC family.
This chip includes a Quad-core ARM Cortex-A53 as an Application
Processing Unit, Dual-core ARM Cortex-R5 as a Real-Time Processing
Unit and ARM Mali-400 MP2 as a Graphics Processing Unit.
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
59. Enterprise AI your way
Deep Learning Containers on AC922 with Kubernetes
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
60. PowerAI on IBM Cloud Private
Deployed on AC922
60Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
61. PowerAI on IBM Cloud Private
Deployed on AC922
61Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
62. H2O Driverless AI on IBM Cloud Private
Deployed on AC922
62Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
63. 63
IBM AC922 Deep Learning Cluster Architecture Overview
Containerized environment - 40x NVIDIA Volta V100 GPU’s
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
64. 64
IBM AC922 Deep Learning System Cluster POD
40x NVIDIA Volta V100 GPU’s
Cognitive Systems Europe / August 31 / © 2018 IBM Corporation
Hinweis der Redaktion This slide provides a physical view of the GPU.
The top view showing the chip and regulators, and the bottom view showing the 800 pins of interconnect to the backplane.
The upper right picture is a completed assembly with the heat sink assembly added. The heat is sink is required to cool the 300 Watts of power in an air cooled machine. The IBM AC922 has a new Board Management Controller (BMC) interface called OpenBMC. Open BMC is a free open source management software Linux distribution of which IBM is a community leader … and gaining attention from users all over the marketplace. Quite simply, OpenBMC is the code stack used with the AC922 industry standard BMC service processor controller. Think of OpenBMC analogous to the way your car is likely inspected in the shop. It used to be the case where you would bring your car into the shop when you heard a sound, or on some maintenance window. Perhaps a mechanic would shine a light, diagnose, and investigate what was wrong with the car. Today, they simply plug a computer into the car’s port and it tells the mechanic what’s wrong (which begs the question why are they paid so much, but that’s a different conversation).
IPMI SoCs are known as baseboard management controllers (BMCs). The BMC is connected to most of the standard buses on the motherboard, so it can monitor temperature and fan sensors, storage devices and expansion cards, and even access the network (through its own virtual network interface that includes a separate MAC address). But BMCs almost invariably ship with a proprietary IPMI implementation which is limited in functionality to what the vendor chooses. Furthermore, IPMI is riddled with poor security and, thus, leaves servers vulnerable to all sorts of attacks. Once the BMC has been compromised, the attacker has direct access to essentially every part of the server.
One of the major reasons why the marketplace is enthused about OpenBMC is because of issues associated with the Intelligent Platform Management Interface (IPMI) – a set of system-management-and-monitoring APIs typically implemented on server motherboards via an embedded system-on-chip (SoC) that functions completely outside of the host system's BIOS and operating system. While IPMI is intended as a convenience for those who must manage dozens or hundreds of servers in a remote facility, IPMI has been called out for its potential as a serious hole in server security.
IBM pulled the OpenBMC project into a Design Thinking workshop and facilitated a group of external clients and contributors who helped enable the interface’s look and feel. When this was sent out for a broader set of reviews and followed up with the Net Promoter Score (NPS) questionnaire, it received a preliminary score of 100!
Learn more about OpenBMC at: https://lwn.net/Articles/683320/.
Roadmap to Containers: NVIDIA frameworks are being delivered via container strategy.
Data Science Apps and Value add tools = AI Vision, PIE, DSX, Anaconda = 28HC
ML/DL UI and Flow.... this row seems to be a double count. Parallel training is DDL. DLI is part of Spectrum CwS integration
DL Frameworks: 30HC
DDL: 11HC
Runtime Resources/WL = ~ Spark, CwC.Cfc = 6HC Source: https://www.kaggle.com/surveys/2017
What data science methods are used at work?
Deep Learning is Growing Exponentially, but Machine Learning still has a strong foothold
You can use PowerAI Vision for semi-automatic labeling You can use PowerAI Vision for semi-automatic labeling The PowerAI Inference Engine can map trained AI models to all kinds of embedded devices & accelerators In this demo, the UI on the left is called PowerAI Inference Engine (PIE). It’s a user interface designed for developers to compile compressed versions of trained neural networks. A large neural network needs to be compressed so that it can run with the same accuracy on a less compute intense hardware called FPGA. PIE is available as a prototype for our customers to use.
The video on the right side shows inference of the model once the compressed model is imported. The card in red uses a Xilink ZYNQ series chip which is an FPGA.