SlideShare a Scribd company logo
1 of 52
Download to read offline
How to use Apache TVM to
optimize your ML models
Sameer Farooqui
Product Marketing Manager, OctoML
Faster inference in the cloud and at the edge
2
Faster Artificial Intelligence Everywhere
3
Optimizing Deep Learning Compiler
siliconANGLE
● “...cross-platform model compilers [...] are harbingers of
the new age in which it won’t matter what front-end tool
you used to build your AI algorithms and what back-end
clouds, platforms or chipsets are used to execute them.”
● “Cross-platform AI compilers will become standard
components of every AI development environment,
enabling developers to access every deep learning
framework and target platform without having to know
the technical particular of each environment.”
● “...within the next two to three years, the AI industry will
converge around one open-source cross-compilation
supported by all front-end and back-end environments”
4
Read the article
April 2018
Quotes from article:
Venture Beat
“With PyTorch and TensorFlow, you’ve seen the frameworks
sort of converge. The reason quantization comes up, and a
bunch of other lower-level efficiencies come up, is because
the next war is compilers for the frameworks — XLA, TVM,
PyTorch has Glow, a lot of innovation is waiting to happen,” he
said.
“For the next few years, you’re going to see … how to
quantize smarter, how to fuse better, how to use GPUs more
efficiently, [and] how to automatically compile for new
hardware.”
5
Read the article
Quote from Soumith Chintala:
(co-creator of PyTorch and distinguished engineer at Facebook AI)
Jan 2020
This Talk
6
● What is a ML Compiler?
● How TVM works
● TVM use cases
● OctoML Product Demo
7
Source code
Classical Compiler
Frontend Optimizer Backend Machine code
8
C
Classical Compiler
C Frontend
Common
Optimizer
PowerPC
Backend
PowerPC
Fortran
Fortran
Frontend
Ada code
Ada
Frontend
X86
Backend
x86
Arm
Backend
Arm
Source: The Architecture of Open Source Applications
9
Neural Network
Deep Learning Compiler
PyTorch
Optimizing
Compiler
GPUs
TensorFlow
ONNX
CPUs CPU optimized runtime
Accelerators
Neural Network
Neural Network
GPU optimized runtime
Accelerator optimized runtime
10
Neural Network
Deep Learning Compiler
PyTorch
GPUs
TensorFlow
ONNX
CPUs CPU optimized runtime
Accelerators
Neural Network
Neural Network
GPU optimized runtime
Accelerator optimized runtime
TVM:
11
An Automated End-to-End Optimizing
Compiler for Deep Learning
● “There is an increasing need to bring machine
learning to a wide diversity of hardware
devices”
● TVM is “a compiler that exposes graph-level
and operator-level optimizations to provide
performance portability to deep learning
workloads across diverse hardware back-
ends”
● “Experimental results show that TVM delivers
performance across hardware back-ends that
are competitive with state-of-the-art, hand-
tuned libraries for low-power CPU, mobile
GPU, and server-class GPUs”
Read the paper
Feb 2018
Relay:
12
A High-level Compiler for
Deep Learning
● Relay is “a high-level IR that enables end-to-
end optimization of deep learning models for a
variety of devices”
● “Relay's functional, statically typed
intermediate representation (IR) unifies and
generalizes existing DL IRs to express state-of-
the-art models”
● “With its extensible design and expressive
language, Relay serves as a foundation for
future work in applying compiler techniques to
the domain of deep learning systems”
Read the paper
April 2019
Ansor:
13
Generating High-Performance
Tensor Programs for Deep Learning
● “...obtaining performant tensor programs for
different operators on various hardware
platforms is notoriously challenging”
● Ansor is “a tensor program generation
framework for deep learning applications”
● “Ansor can find high-performance programs
that are outside the search space of existing
state-of-the-art approaches”
● “We show that Ansor improves the execution
performance of deep neural networks relative
to the state-of-the-art on the Intel CPU, ARM
CPU, and NVIDIA GPU by up to 3.8×, 2.6×, and
1.7×, respectively”
Read the paper
Nov 2020
14
Thank you Apache TVM contributors! 500+!
Who is using TVM?
15
Every Alexa wake-up today across all
devices uses a TVM-optimized model
“At Facebook, we've been contributing to
TVM for the past year and a half or so, and
it's been a really awesome experience”
“We're really excited about the performance
of TVM.” - Andrew Tulloch, AI Researcher
Bing query understanding: 3x faster on CPU
QnA bot: 2.6 faster on CPU, 1.8x faster on GPU
Who attended TVM Conf 2020?
16
950+ attendees
17
Deep Learning Systems Landscape (open source)
Orchestrators
Frameworks
Accelerators
Vendor
Libraries
Hardware
NVIDIA cuDNN Intel oneDNN Arm Compute Library
CPUs GPUs Accelerators
18
Graph Level Optimizations
Rewrites dataflow graphs (nodes and edges) to simplify
the graph and reduce device peak memory usage
Operator Level Optimizations
Hardware target-specific low-level optimizations for
individual operators/nodes in the graph.
Efficient Runtime
TVM optimized models run in the lightweight TVM Runtime
System, providing a minimal API for loading and executing
the model in Python, C++, Rust, Go, Java or Javascript
How does TVM work?
Deep Learning Operators
19
● Deep Neural Networks look like Directed Acyclic Graphs (DAGs)
● Operators are the building blocks (nodes) of neural network models
● Network edges represent data flowing between operators
Convolution
Broadcast Add
Matrix
Multiplication
Pooling
Batch
Normalization
ArgMin/ArgMax
Dropout
DynamicQuantizeLinear
Gemm
LSTM
LeakyRelu
Softmax
OneHotEncoder
RNN
Sigmoid
20
1
2
7
3
Relay
PyTorch / TensorFlow / ONNX
4
5
6
TE + Computation
AutoTVM/ Auto-scheduler
TE + Schedule
TIR
Hardware Specific Compiler
TVM Internals
21
Relay
● Relay has a functional, statically typed intermediate representation (IR)
22
Auto-scheduler (a.k.a. Ansor)
● Auto-scheduler (2nd gen) replaces AutoTVM
● Auto-scheduler/Ansor aims to a fully automated scheduler for generating
high-performance code for tensor computations, without manual templates
● Auto-scheduler can achieve better performance with faster search time in
a more automated way b/c of innovations in search space construction and
search algorithm
● Goal: Automatically turn tensor operations (like matmul or conv2d) into efficient code
implementation
● AutoTVM (1st gen): template-based search algorithm to find efficient implementation for
tensor operations.
○ required domain experts to write a manual template for every operator on every
platform, > 15k loc in TVM
Collaborators:
23
AutoTVM vs Auto-scheduler
Source: Apache TVMBlog: Introducing Auto-scheduler
24
Auto-scheduler’s Search Process
Source: Apache TVMBlog: Introducing Auto-scheduler
25
Benchmarks: AutoTVM vs Auto-scheduler
Source: Apache TVMBlog: Introducing Auto-scheduler
Code Performance
Comparison
(higher is better)
Search Time
Comparison
(lower is better)
26
Auto-scheduling on Apple M1
Source: OctoML Blog: Beating Apple's CoreML4
(lower is better)
● 22% faster on CPU
● 49% faster on GPU
How?
- Effective Auto-scheduler
searching
- Fuse qualified subgraphs
Relay
27
Conv2d
bias
+ relu ...
Conv2d
bias
+ relu
Conv2d
bias
+ relu ...
Conv2d
bias
+ relu
Relay: Fusion
28
Combine into a single fused
operation which can then be
optimized specifically for your
target.
Conv2d
bias
+ relu ...
Conv2d
bias
+ relu
Relay: Fusion
29
Combine into a single fused
operation which can then be
optimized specifically for your
target.
Conv2d
bias
+ relu ...
Conv2d
bias
+ relu
Relay: Device Placement
30
Partition your network to run
on multiple devices.
CPU
GPU
Conv2d
bias
+ relu ...
Conv2d
bias
+ relu
Relay: Layout Transformation
31
Generate efficient code for
different data layouts. NHCW
NHCW
Conv2d
bias
+ relu ...
Conv2d
bias
+ relu
Relay: Layout Transformation
32
Generate efficient code for
different data layouts. NHWC
NHWC
TIR Script
● TIR provides more flexibility than high
level tensor expressions.
● Not everything is expressible in TE and
auto-scheduling is not always perfect.
○ AutoScheduling 3.0 (code-
named AutoTIR coming later
this year)
○ We can also directly write TIR
directly using TIRScript.
33
@tvm.script.tir
def fuse_add_exp(a: ty.handle, c: ty.handle) -> None:
A = tir.match_buffer(a, (64,))
C = tir.match_buffer(c, (64,))
B = tir.alloc_buffer((64,))
with tir.block([64], "B") as [vi]:
B[vi] = A[vi] + 1
with tir.block([64], "C") as [vi]:
C[vi] = exp(B[vi])
Select Performance
Results
34
Faster Kernels for Dense-
Sparse Multiplication
● Performance comparison on
PruneBERT
● 3-10x faster than cuBLAS and
cuSPARSE.
● 1 engineer writing TensorIR kernels
35
Model x hardware comparison points
Performance at OctoML in 2020
Over 60 model x hardware benchmarking
studies
Each study compared TVM against best*
baseline on the target
Sorted by ascending log2 gain over baseline
36
TVM log2 fold improvement
over baseline
Model x hardware comparison points
37
TVM log2 fold improvement
over baseline
Across a broad variety of models and platforms
2.5x average performance improvement on non-public models
(2.1x across all)
Model x hardware comparison points
38
TVM log2 fold improvement
over baseline
Across a broad variety of models and platforms
34x for Yolo-V3 on a MIPS based camera platform
5.3x: video analysis model on Nvidia T4 against TensorRT
4x: random forest on Nvidia 1070 against XGBoost
2.5x: MobilenetV3 on ARM A72 CPU
Model x hardware comparison points
39
TVM log2 fold improvement
over baseline
Across a broad variety of models and platforms
34x for Yolo-V3 on a MIPS based camera platform
5.3x: video analysis model on Nvidia T4 against TensorRT
4x: random forest on Nvidia 1070 against XGBoost
2.5x: MobilenetV3 on ARM A72 CPU
Model x hardware comparison points
40
TVM log2 fold improvement
over baseline
Across a broad variety of models and platforms
34x for Yolo-V3 on a MIPS based camera platform
5.3x: video analysis model on Nvidia T4 against TensorRT
4x: random forest on Nvidia 1070 against XGBoost
2.5x: MobilenetV3 on ARM A72 CPU
Case Study: 90% cloud
inference cost reduction
Background
● Top 10 Tech Company running multiple
variations of customized CV models
● Model in batch processing /offline mode
using standard HW targets of a major
public cloud.
● Billions of inferences per month
● Benchmarking on CPU and GPU
Results
● 3.8x - TensorRT 8bit to TVM 8bit
● 10x - TensorRT 8bit to TVM 4bit
● Potential to reduce hourly costs by 90%
41
*V100at hourly price of $3.00per hour, T4at $0.53
Up to 10X
inferences/doll
ar
increase
See https://github.com/tlc-pack/tlcbench for benchmark scripts 42
Results: TVM on CPU and GPU
20core Intel-Platinum-8269CYfp32performance data
Intel X86 - 2-5X Performance
Normalized
performance
Normalized
performance
V100fp32performance data
NVIDIA GPU - 20-50% versus TensorRT
Normalized
performance
Normalized
performance
Why use the Octomizer vs “just” TVM OSS?
43
Octomizer
Compile
Optimize
Benchmark
Model x HW
analytics data
ML Performance
Model
● Access to OctoML’s “cost models”
○ We aggregate Models x HW data
○ Continuous improvement
● No need to install any SW, latest TVM
● No need to set up benchmarking HW
● “Outer loop” automation
○ optimize/package multiple models against
many HW targets in one go
● Access to comprehensive benchmarking data
○ E.g., for procurement, for HW vendor
competitive analysis
● Access to OctoML support
44
Octomizer Live Demo
API access
Waitlist! octoml.ai
45
The Octonauts!
You?
View career opportunities at
octoml.ai/careers
Thank you!
How to use Apache TVM to optimize your ML models
By Sameer Farooqui
48
49
50
51
52

More Related Content

What's hot

[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...Taiji Suzuki
 
【DL輪読会】Segment Anything
【DL輪読会】Segment Anything【DL輪読会】Segment Anything
【DL輪読会】Segment AnythingDeep Learning JP
 
最近のディープラーニングのトレンド紹介_20200925
最近のディープラーニングのトレンド紹介_20200925最近のディープラーニングのトレンド紹介_20200925
最近のディープラーニングのトレンド紹介_20200925小川 雄太郎
 
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation SystemDeep Learning JP
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series DataArun Kejariwal
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 
SSII2019TS: Shall We GANs?​ ~GANの基礎から最近の研究まで~
SSII2019TS: Shall We GANs?​ ~GANの基礎から最近の研究まで~SSII2019TS: Shall We GANs?​ ~GANの基礎から最近の研究まで~
SSII2019TS: Shall We GANs?​ ~GANの基礎から最近の研究まで~SSII
 
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)Preferred Networks
 
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」ManaMurakami1
 
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action DiffusionDeep Learning JP
 
開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDKNVIDIA Japan
 
[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View SynthesisDeep Learning JP
 
分散深層学習 @ NIPS'17
分散深層学習 @ NIPS'17分散深層学習 @ NIPS'17
分散深層学習 @ NIPS'17Takuya Akiba
 
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection[DL輪読会]EfficientDet: Scalable and Efficient Object Detection
[DL輪読会]EfficientDet: Scalable and Efficient Object DetectionDeep Learning JP
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論Taiji Suzuki
 
HalideでつくるDomain Specific Architectureの世界
HalideでつくるDomain Specific Architectureの世界HalideでつくるDomain Specific Architectureの世界
HalideでつくるDomain Specific Architectureの世界Fixstars Corporation
 
(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?
(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?
(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?Ichigaku Takigawa
 
深層学習とベイズ統計
深層学習とベイズ統計深層学習とベイズ統計
深層学習とベイズ統計Yuta Kashino
 
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs & outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs & outputs
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs Deep Learning JP
 

What's hot (20)

[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
 
【DL輪読会】Segment Anything
【DL輪読会】Segment Anything【DL輪読会】Segment Anything
【DL輪読会】Segment Anything
 
最近のディープラーニングのトレンド紹介_20200925
最近のディープラーニングのトレンド紹介_20200925最近のディープラーニングのトレンド紹介_20200925
最近のディープラーニングのトレンド紹介_20200925
 
TVM の紹介
TVM の紹介TVM の紹介
TVM の紹介
 
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series Data
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
SSII2019TS: Shall We GANs?​ ~GANの基礎から最近の研究まで~
SSII2019TS: Shall We GANs?​ ~GANの基礎から最近の研究まで~SSII2019TS: Shall We GANs?​ ~GANの基礎から最近の研究まで~
SSII2019TS: Shall We GANs?​ ~GANの基礎から最近の研究まで~
 
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
 
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正版)」
 
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
【DL輪読会】Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
 
開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK
 
[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
[DL輪読会]NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
 
分散深層学習 @ NIPS'17
分散深層学習 @ NIPS'17分散深層学習 @ NIPS'17
分散深層学習 @ NIPS'17
 
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection[DL輪読会]EfficientDet: Scalable and Efficient Object Detection
[DL輪読会]EfficientDet: Scalable and Efficient Object Detection
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論
 
HalideでつくるDomain Specific Architectureの世界
HalideでつくるDomain Specific Architectureの世界HalideでつくるDomain Specific Architectureの世界
HalideでつくるDomain Specific Architectureの世界
 
(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?
(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?
(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?
 
深層学習とベイズ統計
深層学習とベイズ統計深層学習とベイズ統計
深層学習とベイズ統計
 
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs
【DL輪読会】Perceiver io  a general architecture for structured inputs & outputs 【DL輪読会】Perceiver io  a general architecture for structured inputs & outputs
【DL輪読会】Perceiver io a general architecture for structured inputs & outputs
 

Similar to How to use Apache TVM to optimize your ML models

Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
Enabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. LowndesEnabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. LowndesWithTheBest
 
Developing Real-Time Systems on Application Processors
Developing Real-Time Systems on Application ProcessorsDeveloping Real-Time Systems on Application Processors
Developing Real-Time Systems on Application ProcessorsToradex
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V International
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Embarcados
 
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)Cheer Chain Enterprise Co., Ltd.
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3mustafa sarac
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Tyrone Systems
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Multicore_Architecture Book.pdf
Multicore_Architecture Book.pdfMulticore_Architecture Book.pdf
Multicore_Architecture Book.pdfSwatantraPrakash5
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures
The Impact of Compiler Auto-Optimisation on Arm-based HPC MicroarchitecturesThe Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures
The Impact of Compiler Auto-Optimisation on Arm-based HPC MicroarchitecturesNECST Lab @ Politecnico di Milano
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceAlison B. Lowndes
 
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...Intel® Software
 
Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Kynetics
 

Similar to How to use Apache TVM to optimize your ML models (20)

Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Enabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. LowndesEnabling Artificial Intelligence - Alison B. Lowndes
Enabling Artificial Intelligence - Alison B. Lowndes
 
Developing Real-Time Systems on Application Processors
Developing Real-Time Systems on Application ProcessorsDeveloping Real-Time Systems on Application Processors
Developing Real-Time Systems on Application Processors
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML AcceleratorsRISC-V & SoC Architectural Exploration for AI and ML Accelerators
RISC-V & SoC Architectural Exploration for AI and ML Accelerators
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
 
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
Explore Deep Learning Architecture using Tensorflow 2.0 now! Part 2
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Build 2019 Recap
Build 2019 RecapBuild 2019 Recap
Build 2019 Recap
 
Multicore_Architecture Book.pdf
Multicore_Architecture Book.pdfMulticore_Architecture Book.pdf
Multicore_Architecture Book.pdf
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Balancing Power & Performance Webinar
Balancing Power & Performance WebinarBalancing Power & Performance Webinar
Balancing Power & Performance Webinar
 
The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures
The Impact of Compiler Auto-Optimisation on Arm-based HPC MicroarchitecturesThe Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures
The Impact of Compiler Auto-Optimisation on Arm-based HPC Microarchitectures
 
Harnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligenceHarnessing the virtual realm for successful real world artificial intelligence
Harnessing the virtual realm for successful real world artificial intelligence
 
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
Tuning For Deep Learning Inference with Intel® Processor Graphics | SIGGRAPH ...
 
Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7Heterogeneous multiprocessing on androd and i.mx7
Heterogeneous multiprocessing on androd and i.mx7
 
Onnc intro
Onnc introOnnc intro
Onnc intro
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 

Recently uploaded (20)

Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 

How to use Apache TVM to optimize your ML models

  • 1. How to use Apache TVM to optimize your ML models Sameer Farooqui Product Marketing Manager, OctoML Faster inference in the cloud and at the edge
  • 4. siliconANGLE ● “...cross-platform model compilers [...] are harbingers of the new age in which it won’t matter what front-end tool you used to build your AI algorithms and what back-end clouds, platforms or chipsets are used to execute them.” ● “Cross-platform AI compilers will become standard components of every AI development environment, enabling developers to access every deep learning framework and target platform without having to know the technical particular of each environment.” ● “...within the next two to three years, the AI industry will converge around one open-source cross-compilation supported by all front-end and back-end environments” 4 Read the article April 2018 Quotes from article:
  • 5. Venture Beat “With PyTorch and TensorFlow, you’ve seen the frameworks sort of converge. The reason quantization comes up, and a bunch of other lower-level efficiencies come up, is because the next war is compilers for the frameworks — XLA, TVM, PyTorch has Glow, a lot of innovation is waiting to happen,” he said. “For the next few years, you’re going to see … how to quantize smarter, how to fuse better, how to use GPUs more efficiently, [and] how to automatically compile for new hardware.” 5 Read the article Quote from Soumith Chintala: (co-creator of PyTorch and distinguished engineer at Facebook AI) Jan 2020
  • 6. This Talk 6 ● What is a ML Compiler? ● How TVM works ● TVM use cases ● OctoML Product Demo
  • 7. 7 Source code Classical Compiler Frontend Optimizer Backend Machine code
  • 8. 8 C Classical Compiler C Frontend Common Optimizer PowerPC Backend PowerPC Fortran Fortran Frontend Ada code Ada Frontend X86 Backend x86 Arm Backend Arm Source: The Architecture of Open Source Applications
  • 9. 9 Neural Network Deep Learning Compiler PyTorch Optimizing Compiler GPUs TensorFlow ONNX CPUs CPU optimized runtime Accelerators Neural Network Neural Network GPU optimized runtime Accelerator optimized runtime
  • 10. 10 Neural Network Deep Learning Compiler PyTorch GPUs TensorFlow ONNX CPUs CPU optimized runtime Accelerators Neural Network Neural Network GPU optimized runtime Accelerator optimized runtime
  • 11. TVM: 11 An Automated End-to-End Optimizing Compiler for Deep Learning ● “There is an increasing need to bring machine learning to a wide diversity of hardware devices” ● TVM is “a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back- ends” ● “Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand- tuned libraries for low-power CPU, mobile GPU, and server-class GPUs” Read the paper Feb 2018
  • 12. Relay: 12 A High-level Compiler for Deep Learning ● Relay is “a high-level IR that enables end-to- end optimization of deep learning models for a variety of devices” ● “Relay's functional, statically typed intermediate representation (IR) unifies and generalizes existing DL IRs to express state-of- the-art models” ● “With its extensible design and expressive language, Relay serves as a foundation for future work in applying compiler techniques to the domain of deep learning systems” Read the paper April 2019
  • 13. Ansor: 13 Generating High-Performance Tensor Programs for Deep Learning ● “...obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging” ● Ansor is “a tensor program generation framework for deep learning applications” ● “Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches” ● “We show that Ansor improves the execution performance of deep neural networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA GPU by up to 3.8×, 2.6×, and 1.7×, respectively” Read the paper Nov 2020
  • 14. 14 Thank you Apache TVM contributors! 500+!
  • 15. Who is using TVM? 15 Every Alexa wake-up today across all devices uses a TVM-optimized model “At Facebook, we've been contributing to TVM for the past year and a half or so, and it's been a really awesome experience” “We're really excited about the performance of TVM.” - Andrew Tulloch, AI Researcher Bing query understanding: 3x faster on CPU QnA bot: 2.6 faster on CPU, 1.8x faster on GPU
  • 16. Who attended TVM Conf 2020? 16 950+ attendees
  • 17. 17 Deep Learning Systems Landscape (open source) Orchestrators Frameworks Accelerators Vendor Libraries Hardware NVIDIA cuDNN Intel oneDNN Arm Compute Library CPUs GPUs Accelerators
  • 18. 18 Graph Level Optimizations Rewrites dataflow graphs (nodes and edges) to simplify the graph and reduce device peak memory usage Operator Level Optimizations Hardware target-specific low-level optimizations for individual operators/nodes in the graph. Efficient Runtime TVM optimized models run in the lightweight TVM Runtime System, providing a minimal API for loading and executing the model in Python, C++, Rust, Go, Java or Javascript How does TVM work?
  • 19. Deep Learning Operators 19 ● Deep Neural Networks look like Directed Acyclic Graphs (DAGs) ● Operators are the building blocks (nodes) of neural network models ● Network edges represent data flowing between operators Convolution Broadcast Add Matrix Multiplication Pooling Batch Normalization ArgMin/ArgMax Dropout DynamicQuantizeLinear Gemm LSTM LeakyRelu Softmax OneHotEncoder RNN Sigmoid
  • 20. 20 1 2 7 3 Relay PyTorch / TensorFlow / ONNX 4 5 6 TE + Computation AutoTVM/ Auto-scheduler TE + Schedule TIR Hardware Specific Compiler TVM Internals
  • 21. 21 Relay ● Relay has a functional, statically typed intermediate representation (IR)
  • 22. 22 Auto-scheduler (a.k.a. Ansor) ● Auto-scheduler (2nd gen) replaces AutoTVM ● Auto-scheduler/Ansor aims to a fully automated scheduler for generating high-performance code for tensor computations, without manual templates ● Auto-scheduler can achieve better performance with faster search time in a more automated way b/c of innovations in search space construction and search algorithm ● Goal: Automatically turn tensor operations (like matmul or conv2d) into efficient code implementation ● AutoTVM (1st gen): template-based search algorithm to find efficient implementation for tensor operations. ○ required domain experts to write a manual template for every operator on every platform, > 15k loc in TVM Collaborators:
  • 23. 23 AutoTVM vs Auto-scheduler Source: Apache TVMBlog: Introducing Auto-scheduler
  • 24. 24 Auto-scheduler’s Search Process Source: Apache TVMBlog: Introducing Auto-scheduler
  • 25. 25 Benchmarks: AutoTVM vs Auto-scheduler Source: Apache TVMBlog: Introducing Auto-scheduler Code Performance Comparison (higher is better) Search Time Comparison (lower is better)
  • 26. 26 Auto-scheduling on Apple M1 Source: OctoML Blog: Beating Apple's CoreML4 (lower is better) ● 22% faster on CPU ● 49% faster on GPU How? - Effective Auto-scheduler searching - Fuse qualified subgraphs
  • 28. Conv2d bias + relu ... Conv2d bias + relu Relay: Fusion 28 Combine into a single fused operation which can then be optimized specifically for your target.
  • 29. Conv2d bias + relu ... Conv2d bias + relu Relay: Fusion 29 Combine into a single fused operation which can then be optimized specifically for your target.
  • 30. Conv2d bias + relu ... Conv2d bias + relu Relay: Device Placement 30 Partition your network to run on multiple devices. CPU GPU
  • 31. Conv2d bias + relu ... Conv2d bias + relu Relay: Layout Transformation 31 Generate efficient code for different data layouts. NHCW NHCW
  • 32. Conv2d bias + relu ... Conv2d bias + relu Relay: Layout Transformation 32 Generate efficient code for different data layouts. NHWC NHWC
  • 33. TIR Script ● TIR provides more flexibility than high level tensor expressions. ● Not everything is expressible in TE and auto-scheduling is not always perfect. ○ AutoScheduling 3.0 (code- named AutoTIR coming later this year) ○ We can also directly write TIR directly using TIRScript. 33 @tvm.script.tir def fuse_add_exp(a: ty.handle, c: ty.handle) -> None: A = tir.match_buffer(a, (64,)) C = tir.match_buffer(c, (64,)) B = tir.alloc_buffer((64,)) with tir.block([64], "B") as [vi]: B[vi] = A[vi] + 1 with tir.block([64], "C") as [vi]: C[vi] = exp(B[vi])
  • 35. Faster Kernels for Dense- Sparse Multiplication ● Performance comparison on PruneBERT ● 3-10x faster than cuBLAS and cuSPARSE. ● 1 engineer writing TensorIR kernels 35
  • 36. Model x hardware comparison points Performance at OctoML in 2020 Over 60 model x hardware benchmarking studies Each study compared TVM against best* baseline on the target Sorted by ascending log2 gain over baseline 36 TVM log2 fold improvement over baseline
  • 37. Model x hardware comparison points 37 TVM log2 fold improvement over baseline Across a broad variety of models and platforms 2.5x average performance improvement on non-public models (2.1x across all)
  • 38. Model x hardware comparison points 38 TVM log2 fold improvement over baseline Across a broad variety of models and platforms 34x for Yolo-V3 on a MIPS based camera platform 5.3x: video analysis model on Nvidia T4 against TensorRT 4x: random forest on Nvidia 1070 against XGBoost 2.5x: MobilenetV3 on ARM A72 CPU
  • 39. Model x hardware comparison points 39 TVM log2 fold improvement over baseline Across a broad variety of models and platforms 34x for Yolo-V3 on a MIPS based camera platform 5.3x: video analysis model on Nvidia T4 against TensorRT 4x: random forest on Nvidia 1070 against XGBoost 2.5x: MobilenetV3 on ARM A72 CPU
  • 40. Model x hardware comparison points 40 TVM log2 fold improvement over baseline Across a broad variety of models and platforms 34x for Yolo-V3 on a MIPS based camera platform 5.3x: video analysis model on Nvidia T4 against TensorRT 4x: random forest on Nvidia 1070 against XGBoost 2.5x: MobilenetV3 on ARM A72 CPU
  • 41. Case Study: 90% cloud inference cost reduction Background ● Top 10 Tech Company running multiple variations of customized CV models ● Model in batch processing /offline mode using standard HW targets of a major public cloud. ● Billions of inferences per month ● Benchmarking on CPU and GPU Results ● 3.8x - TensorRT 8bit to TVM 8bit ● 10x - TensorRT 8bit to TVM 4bit ● Potential to reduce hourly costs by 90% 41 *V100at hourly price of $3.00per hour, T4at $0.53 Up to 10X inferences/doll ar increase
  • 42. See https://github.com/tlc-pack/tlcbench for benchmark scripts 42 Results: TVM on CPU and GPU 20core Intel-Platinum-8269CYfp32performance data Intel X86 - 2-5X Performance Normalized performance Normalized performance V100fp32performance data NVIDIA GPU - 20-50% versus TensorRT Normalized performance Normalized performance
  • 43. Why use the Octomizer vs “just” TVM OSS? 43 Octomizer Compile Optimize Benchmark Model x HW analytics data ML Performance Model ● Access to OctoML’s “cost models” ○ We aggregate Models x HW data ○ Continuous improvement ● No need to install any SW, latest TVM ● No need to set up benchmarking HW ● “Outer loop” automation ○ optimize/package multiple models against many HW targets in one go ● Access to comprehensive benchmarking data ○ E.g., for procurement, for HW vendor competitive analysis ● Access to OctoML support
  • 44. 44 Octomizer Live Demo API access Waitlist! octoml.ai
  • 45. 45 The Octonauts! You? View career opportunities at octoml.ai/careers
  • 46. Thank you! How to use Apache TVM to optimize your ML models By Sameer Farooqui
  • 47.
  • 48. 48
  • 49. 49
  • 50. 50
  • 51. 51
  • 52. 52