Infrastructure Solutions for Deploying AI/ML/DL Workloads at Scale

Vikas Ratna, Product Manager, Cisco – vratna@cisco.com
Han Yang, Product Manager, Cisco – hanyang@cisco.com
November 7, 2018
Infrastructure Solutions for Deploying
AI/ML/DL Workloads at Scale

AI/ML/DLWorkload In Enterprises
• Why Now? Why Beyond Hype?
• Customer Challenges & Cisco UCS Strategy
• Solutions and Partnerships
• Demo – Deploying at Scale
• Wrap up
Agenda

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Partner Confidential
AI/ML/DL
Act of Artificial
Intelligence/Inferencing
Machine
Learning
Deep
Learning
- Machines learn from data
Execute Defined Action
Produce Model
- Then based on learning, Machines
make decisions OR predict/infer things
when new data comes in …mostly in
real time
a.k.a Inferencing
a.k.a Training (Simple or Deep)

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Partner Confidential• 277X data created by IoE devices vs. end users – source: 2014 Cisco® Global Cloud Index
• By 2020, there will be 5200 GB of data for every person on earth – source: 2012 Digital Universe Study conducted by IDC and sponsored by EMC
(see: http://www.computerworld.com/article/2493701/data-center/by-2020--there-will-be-5-200-gb-of-data-for-every-person-on-earth.html)
• 180 billion mobile app downloads by 2015 – source: 2011 IDC Study: https://www.smaato.com/blog-180billiondownloads/
277X
Data created by
IoT devices vs.
end users
30M
New devices
connected
every week
180B
Mobile apps
downloaded
in 2015
40%
Of all data will
come from
sensor
data by 2020
5TB+
Of data
per person
by 2020
4.2B
Web
filtering
blocks per
day
Why Now?
First…The Data Explosion - Volume, Velocity, Variety
Unrealistic to use traditional analytics to gain meaningful insight at speed.
A New Approach is Needed!!!

Why Now?
Second –Perfectly timed arrival of Data Infra, Compute, Network and ML
Innovation
Improved
Data Collection Infrastructure
Improved
Computing Power & Network
Advancement in
ML Frameworks
Hadoop w/Apache Spark Faster GPUs, CPUs & Network TensorFlow, PyTorch, Caffe, Theano

Beyond Hype…Solving Biz Needs AcrossAll IndustryVerticals..
Retail
Security and
Defense
Media and
Entertainment
HealthcareFinanceAI Task
Business
Question
Is "it" present
or not?
What type of
thing is "it"?
To what
extent is "it"
present?
What is the
interpretation?
What is the
likely
outcome?
What will
satisfy the
objective?
Detection
Classification
Segmentation
Natural
Language
Processing
Prediction
Recommenda
tions
Identify
Access
Anomalies
Fraud
detection
Sentiment
Analysis
Chatbot
Advisors
Credit
Profiling
Algorithmic
Trading
Indication of
Anomaly in
Scan
Diagnostics –
Tumor?
Condition
Analysis
Expert
Diagnosis
from Notes
Length of Stay
Forecasting
Treatment
Recommenda
tions
Content
Based Search
Content
Labeling
Improved
Product
Placement
Video
Captioning
Targeted
Content
Generation
Content
Recommenda
tions
Identify
Security
Breaches
Facial
Recognition
Crowd
Analytics
Real Time
Language
Translation
Equipment
Health
Assessment
Risk
Management
Events in
Store
Surveillance
Returning vs
New
Shoppers
Segment by
Customers
Actions
In Store
Personal
Assistants
Customer
Churn and
Retention
"Magic Mirror"
Manufacturing
Detect
Manufacturing
Flaws
Robots to
Track Objects
Sort
Components
by Quality
Assembly Build
Instruction
Translation
Proactive
Machine
Maintenance
Assembly
Process
Improvements

High Level View of Workflow When Using AI/ML In Production
Data
Infrastructure
Prepare Data Train a Model
Evaluate
the Model
Deploy, Inference
& Improve
Data is at the heart.
More data trumps
better algorithms
Data and business
questions determine
ML algorithm(s)
Data can come from
anywhere and is not
usually in a state
where it is ready to
use for training ML
models
Build a model and
feed the model with
prepared training data
so that it can learn to
make inferences
Test the trained model
performance and
accuracy by analyzing
inference feedback
Deploy Model for
inferencing. Evaluate and
Improve accuracy by either
selecting different
algorithms and/or retraining
model with more data
Big Data
ML/DL Framework
Training
Inference

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Customer AI Challenges

Customer AI Challenges
• C-Suite wants an AI
strategy
• Marketing wants to
include AI in everything
• Worried about falling
behind and competition
passing
• Not sure what problems
to solve; not enough
expertise
• Everyone is pitching to
them and they are
overwhelmed
ML Value StackProfessional services Consulting Practices Implementation Practices Learning/training Practices
Servers, Appliance, storage
ML Framework/Library
Silicon
Drivers
OS, run time environ
Intel® Math Kernel
Library (MKL, MKL-
DNN)
Intel® NervanaTM
Graph
Data Science / ML Platform
Data Prep >> Modelling support >> Deployment
Roots from
open source
Proprietary
Infra management Data mgmt
Infra mgmt
ML application and API
Looking For Guidance, Simplification
And this is only one part of
infrastructure.
This needs to connect and
harmoniously co-exist with
infrastructure that
generates data!

Key Challenges for IT
Massive & Active
Data Sets
Volume, Velocity, and Variability of
AI Workloads at Scale Demand
New Data Center Architectures
Uncharted
Territory
Rapidly Evolving AI/ML
Ecosystem and Requirements;
Skill Shortages in Data
Science and IT
Seeking AI/ML Infra which is: Silo-less, Simple to manage at scale, Validated for new workload
Operational
Efficiency
Distributed Data Sources,
New Workloads Risk
Operational Silos and
Complexity

Cisco Compute Strategy
&
Product Portfolio

Recap: Cisco Unified Computing System (UCS) Momentum
Source: 1 IDC, 2017 Q4, Sep 2017, Vendor Revenue Share
Source: As of Cisco Q4FY17 earnings results. Data Center Revenue is defined as Cisco UCS and Nexus 1000V
Integrated Infrastructure
(Cisco UCS, Nexus, FlexPod)
#1
Hyperconverged vendor
#3
Global Revenue
Market Share in x86 Blades
#2
World Record Performance
Benchmark
150+
Exabyte Total Storage
Deployed
1.4
Big Data Revenue
Growth in 4 Years
18x
Enterprise customers
64,000+
of Fortune 500 Have
Invested in UCS
>85%
HyperFlex Customers
3000+

Component
Cisco UCS Portfolio: Any Workload. Any Scale. One System.
Hyperconverged
Infrastructure
Converged
InfrastructureROBO
Fifth
Generation UCS
HyperFlex
Systems
UCS Mini
E-Series
Data Intensive
S-Series
Storage Optimized
C-Series
Rack Servers
UCS Integrated
Infrastructure
Solutions
CoreEdge Cloud

AI/ML Strategy:
Focus on Full Data Life
Cycle, Simplicity, and
Manageability
Model
Deployment/
Inference
Data
Source & Aggregate
Cleanup
Transformation
Training
Model Dev
Model Validation
Model Execution
Full portfolio for all AI/ML computing
needs
Validated solutions with technology
partners
No Silos: Natural extension of
existing computing environment
Application runtime at
source of demand

Inference in Regional and Micro Data Centers
Cisco AI/ML - Computing Platforms, Partners, and Solutions
UCS C220 Servers
HyperFlex C220 Servers
UCS C240 Servers
HyperFlex GPU Nodes
Test/Dev and ML in Private Cloud
Deep Learning in DC Core
Accelerated Compute Portfolio Software Partners Solution Partners
NEW!
UCS C480 ML
UCS C480 ML

Cisco AI/ML Compute Portfolio – Addressing All Aspect
Test & Dev and Model Training
C240
2 x P4
6 x P4
HyperFlex
240
Deep Learning/ Training
C480
Inferencing
C/HX 220
C/HX 240
Unified Management
Option of GPU Only Nodes
2 x P100/ V100 2 x P100/ V100 Per Node
6 x PCIe P100/ V100 8x V100 with NVLink
C480 ML
Cisco IMC XML API
 New

Cisco UCS C480 ML Rack Server
No-compromise balance of performance and capacity to power AI workloads at scale
NEW
Prevents Operation Silos: Extends Existing UCS Environments
with Consistent, Cloud-Based Management
Validated with Popular Machine Learning Software to
Accelerate and Simplify AI/ML Projects on Premise
Fully Integrated Platform Designed to Accelerate Deep Learning
• Eight NVIDIA Tesla SXM2 V100s with NVIDIA NVLink Interconnect
• Up to 24 Drives; 182TB
• Up to 6 NVMe Drives
• Network: Up to 4x100GB
• High Availability Design
 40,000+ CUDA Cores

6542
3778
3208 2830
22035
12161
6423
3735 3218 2793
21078
11438
Tensorflow Training: 8GPU (V100)
Synthetic Data Real Data
UCS C480 ML (8x NVIDIA V100 GPU - Tensorflow training results)
Images/sec

Cisco Portfolio Alignment With Deployment Lifecycle
Use Case
Conception
Feasibility
Study
Model
Design/Train
Model
Deployment
Model
Maintenance
Optimized big data
solutions for data
collection and
preparation
Distribute and
scale AI inference
from the data
center to the edge
Offload and
accelerate AI training
at scale with
performance
optimized systems
V100 P4
C240 M5 C480 ML M5 C220 /
HX220c M5
Quick installation of
hardware and
software for AI
exploration and
experimentation
V100
C240/
HX240c M5

DC Infrastructure AI/ML/DL Platforms
Data / BigData
Libraries
Framework
AI-ML
Server / Appliances
(CPU or GPU)
Simplified DC Infra To Support AI/ML Workload & Cisco
AI/ML
Apps/
APIs
Services
Manageability
Network
DC Servers and
Storage
Model Deploy/Mgmt
Existing UCS Business
Several CVDs exist
Strategic Partnerships / Cisco AS
UCSM, Intersight
Exists, Build New
Partner
Build Validated
Designs

Solutions

How Important is AI & ML?
By 2035,
AI technologies are
projected to increase
business productivity
by up to
40%
By 2020, insights-driven
businesses will steal
$1.2T
per annum from their
less-informed peers
8 out of 10
businesses have already
implemented or are
planning to adopt AI as
a customer service
solution by 2020
• $1.2T https://go.forrester.com/wp-content/uploads/Forrester_Predictions_2017_-Artificial_Intelligence_Will_Drive_The_Insights_Revolution.pdf
• 8 out of 10 - Oracle - https://www.oracle.com/webfolder/s/delivery_production/docs/FY16h1/doc35/CXResearchVirtualExperiences.pdf
• Accenture https://www.accenture.com/us-en/insight-artificial-intelligence-future-growth

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Relative Change
in Cash Flow by AI
Adoption Cohort
https://www.mckinsey.com/featured-insights/artificial-intelligence/notes-from-the-frontier-modeling-the-impact-of-ai-on-the-world-
economy?cid=other-eml-alt-mgi-mck-oth-1809&hlkid=5ebe957bb3594f96bedda5695e4664fd&hctky=10366723&hdpid=677435cb-04b0-
445e-afba-4588aa47d2fe

Key Challenges
Massive & Active
Data Sets
Volume, velocity, and variability of
AI workloads at scale demand new
data center architectures
Uncharted Territory
Rapidly evolving AI/ML
ecosystem and requirements;
skill shortages in data science
and IT
The Data Center
Follows the Data
Distributed data sources and
technologies risk operational
silos and complexity

Eliminating Operational Silos Demystifying AI/ML
Stacks
Curating top-to-bottom SW and HW
stacks with leading ecosystem partners to
ensure a faster and more predictable
deployment
Full array of accelerated
computing options for test/dev,
training and inference, all unified
by cloud-based management
Integrating changing data sources as part of a
dynamic data pipeline
Powering the Full AI Data Lifecycle
Cisco Computing Solutions for Machine Learning

Data Pipeline for Single Data Source
Collect Clean Correlate Train
Data
Model Result

Collect
Clean
Correlate
Train
Data Pipeline for Multiple Data Sources
Collect Clean Correlate Train
Data
Model
Result
Collect Clean Correlate Train Model
Collect Clean Correlate Train Model
Social
Video
Model
More
Data
You Are Here Many Verticals

Data Pipeline Software Tools
Collect Clean Correlate TrainData Model

Infrastructure Solutions for the Data Pipeline
Ingestion Compute Intensive Storage

Data Centric Approach: Expanding Big Data to
AI/ML Solutions
Cisco Validated Designs
Cloudera Data
Science Work Bench Kubeflow
Hortonworks Hadoop 3.1
Data Lake
Hadoop coupled with GPU
nodes for deep learning with
Jupyter notebook
Portable, scalable ML stack
enabling rapid development
and deployment
Integrate Hadoop and AI/ML:
YARN Scheduling CPU and GPU
with Docker Application Support
YARN Scheduler
HDFS Hot Tier
HDFS Cold TierHDFS

Solution Architecture Ecosystem of Partners
C480ML
Big Data Cluster
UCS
Hyper
Flex
Delivered through partners
Google
AI/ML
stack
3rd party
AI/ML
platforms
…
UCS platforms…
Object storage
AI/ML SW platforms
Storage
and
Converged Infra
SDS for AI/MLTraditional analytics
*Not All Available, In works

Cloudera Data Science
Workbench Demo

Activating Data with the Power of UCS
New Cisco Computing Solutions for AI/ML
Powering the Full AI
Data Lifecycle
Accelerating insight
and action
Unified
Architecture
Adaptable cloud-managed
systems for distributed IT
Demystifying
AI/ML Stacks
Validated solutions with
industry leaders

• Cisco UCS Infrastructure for Red Hat
OpenShift Container Platform Design
Guide
• Cisco UCS Integrated Infrastructure for
Big Data and Analytics with Cloudera
Data Science at Scale
• Cisco UCS Integrated Infrastructure for
Big Data and Analytics with Hortonworks
Data Platform and Hortonworks
DataFlow
• FlexPod Design Guides
• FlashStack Design Guides
• VersaStack Design Guides
• Cisco UCS Storage Server with Scality
Ring
• Cisco UCS 3260 Storage Server with
SwiftStack Software Defined Object
Storage
Resources

Infrastructure Solutions for Deploying AI/ML/DL Workloads at Scale

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Infrastructure Solutions for Deploying AI/ML/DL Workloads at Scale

Ähnlich wie Infrastructure Solutions for Deploying AI/ML/DL Workloads at Scale (20)

Mehr von Robb Boyd

Mehr von Robb Boyd (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Infrastructure Solutions for Deploying AI/ML/DL Workloads at Scale

Hinweis der Redaktion