An Analysis of Convolution for Inference

•

6 gefällt mir•6,792 views

Scott Gray presents at the 2016 ICML conference. Scott Gray went over various ways of computing convolution in the workshop on "On-device Intelligence".

Technologie

An Analysis of Convolution for Inference
24 June 2016
Scott Gray
Nervana Systems
MAKING MACHINES SMARTER.™

Proprietary and conﬁdential. Do not distribute.ner va na
Direct Convolution
2
• Compute with in-place slicing + gemm
• Data layout considerations: C, H, W, N
• Minimize slicing logic
• Maximize contiguous access
• Leverage filter overlap

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: Without Superblocking
3
fprop
Q = (W-S+1 + 2 * pad) / stride
wi = sk + qj * stride - pad
Fig from V. Dumoulin,
https://github.com/vdumoulin/conv_arithmetic

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: With Superblocking
4
fprop
Q = (W-S+1 + 2 * pad) / stride
wi = sk + qj * stride - pad

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: Bprop for deconv
5
bprop
pad’ = S - pad - 1
wi = (qj - pad’ + sk) / stride

Proprietary and conﬁdential. Do not distribute.ner va na
Small N direct convolution: Dilated Filters
6
Dilated
S’ = (S-1) * rate + 1
Q = (W-S’+1 + 2*pad) / stride
wi = sk * rate + qj * stride - pad
Fig from F. Yu, V. Koltun
http://arxiv.org/abs/1511.07122v3

Proprietary and conﬁdential. Do not distribute.ner va na
Convolution with Algorithmic Speedups
7
• FFT and Winograd have same basic computational flow
• FFT tiles typically need to be much bigger
• Winograd history: Toom and Cook, then Lavin

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: input transform
8
Input Feature Map
4x4 stride 2
• Input transform
• 2D Winograd is a nested
product of 1D transforms
• Transforms can be
simplified to remove zeros

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: filter transform
9
• Filter transform
• Same as input but with
different coefficients
• Transform each feature map
independently

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: batched GEMM
10
• Point-wise Multiplication
• Posed as batched GEMM
operation

Proprietary and conﬁdential. Do not distribute.ner va na
Winograd: output transform
11
Output Feature Map
• Output transform
• Same as input and filter
• Transform back to pixel
space to obtain 2x2 output
tile

Proprietary and conﬁdential. Do not distribute.ner va na
Transforms for Increased Accuracy
12
Integer roots
4 0 -5 0 1 0
0 -4 -4 1 1 0
0 4 -4 -1 1 0
0 -2 -1 2 1 0
0 2 -1 -2 1 0
0 4 0 -5 0 1
0.87 0 -2.64 0 1 0
0 -1.4 -2.25 0.62 1 0
0 1.4 -2.25 -0.62 1 0
0 -0.58 -0.39 1.5 1 0
0 0.58 -0.39 -1.5 1 0
0 0.87 0 -2.64 0 1
Fractional roots
Input transforms for 4x4

$Proprietary and conﬁdential. Do not distribute.ner va na Precision 13 Percentage error from Convolution 0 5 10 15 20 25 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Direct 2x2 Winograd 4x4 winograd (Fractional Roots) 4x4 Winograd (Integer Roots) PercentageError Bit width Bits Direct 2x2 Winograd 4x4 frac 4x4 int 3 56.461 112.174 351.196 314.62 4 23.533 46.222 274.28 432.959 5 10.879 21.394 142.649 459.723 6 5.245 10.34 68.062 446.271 7 2.585 5.074 33.73 250.057 8 1.286 2.516 16.667 123.585 9 0.639 1.253 8.246 62.001 10 0.319 0.626 4.154 31.006 11 0.159 0.312 2.064 15.439 12 0.08 0.156 1.029 7.669 13 0.04 0.078 0.515 3.857 14 0.02 0.039 0.259 1.923 15 0.01 0.019 0.129 0.966 16 0.005 0.01 0.064 0.483$

Proprietary and conﬁdential. Do not distribute.ner va na
Multiplier Transistor Efficiency
14
Algo bits speedup transistors
performance
/ transistor
Direct 8 1.0 3000 1
2x2 9 2.25 3750 1.8
4x4 12 4.0 6000 2.0
Transistor Counts from Wikipedia:

Proprietary and conﬁdential. Do not distribute.ner va na
Logarithmic quantization
15
D. Miyashita, EH. Lee, B. Murmann
Convolutional Neural Networks using Logarithmic Data Representation
http://arxiv.org/abs/1603.01025v2

Proprietary and conﬁdential. Do not distribute.ner va na 16
Performance: VGG fp32 on GTX1080effectiveTFLOPS
Batch Size
VGG - Totals:
0
5
10
15
20
25
64 32 16 8 4 2 1
Neon Direct
Neon F(2x2,3x3)
Neon F(4x4,3x3)
cuDNN FFT

Proprietary and conﬁdential. Do not distribute.ner va na 17
Peak Performance: VGG fp32 on GTX1080effectiveTFLOPS
Batch Size
VGG - Layer 4.2:
0
5
10
15
20
25
64 32 16 8 4 2 1
Neon Direct
Neon F(2x2,3x3)
Neon F(4x4,3x3)
cuDNN FFT

Empfohlen

High-Performance GPU Programming for Deep LearningIntel Nervana

Dx11 performancereloadedmistercteam

Masked Software Occlusion CullingIntel® Software

Dissecting the Rendering of The SurgePhilip Hammer

Bindless Deferred Decals in The Surge 2Philip Hammer

Foveated Ray Tracing for VR on Multiple GPUsTakahiro Harada

Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central

A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)Takahiro Harada

Empfohlen

High-Performance GPU Programming for Deep LearningIntel Nervana

Dx11 performancereloadedmistercteam

Masked Software Occlusion CullingIntel® Software

Dissecting the Rendering of The SurgePhilip Hammer

Bindless Deferred Decals in The Surge 2Philip Hammer

Foveated Ray Tracing for VR on Multiple GPUsTakahiro Harada

Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central

A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)Takahiro Harada

Dds 2Nhân Lê

GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh TruongUmbra Software

Image Segmentation Using Hardware Forest ClassifiersNeil Pittman

Chaotic substitution box design for block ciphersHammad Haleem

Math cad fourier analysis (jcb-edited)Julio Banks

[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering WorkflowTakahiro Harada

Fragging Rights: A Tale of a Pathological Storage WorkloadEric Sproul

Unit 5 vspsushant7dare

Multi core k meansb0rAAs

MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...Kenichiro Tanaka

The InternetDavid Evans

Parallel Implementation of K Means Clustering on CUDAprithan

Neighbourhood Preserving Quantisation for LSH SIGIR PosterSean Moran

Scaling the #2ndhalfSalo Shp

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central

[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...Takahiro Harada

Parallel K means clustering using CUDAprithan

Unite2019 HLOD를 활용한 대규모 씬 제작 방법장규 서

TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central

[BGOUG] Java GC - Friend or FoeSAP HANA Cloud Platform

Deep Learning at ScaleIntel Nervana

Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana

Weitere ähnliche Inhalte

Was ist angesagt?

Dds 2Nhân Lê

GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh TruongUmbra Software

Image Segmentation Using Hardware Forest ClassifiersNeil Pittman

Chaotic substitution box design for block ciphersHammad Haleem

Math cad fourier analysis (jcb-edited)Julio Banks

[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering WorkflowTakahiro Harada

Fragging Rights: A Tale of a Pathological Storage WorkloadEric Sproul

Unit 5 vspsushant7dare

Multi core k meansb0rAAs

MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...Kenichiro Tanaka

The InternetDavid Evans

Parallel Implementation of K Means Clustering on CUDAprithan

Neighbourhood Preserving Quantisation for LSH SIGIR PosterSean Moran

Scaling the #2ndhalfSalo Shp

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central

[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...Takahiro Harada

Parallel K means clustering using CUDAprithan

Unite2019 HLOD를 활용한 대규모 씬 제작 방법장규 서

TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central

[BGOUG] Java GC - Friend or FoeSAP HANA Cloud Platform

Was ist angesagt? (20)

Dds 2

GDC16: Arbitrary amount of 3D data running on Gear VR by Vinh Truong

Image Segmentation Using Hardware Forest Classifiers

Chaotic substitution box design for block ciphers

Math cad fourier analysis (jcb-edited)

[2017 GDC] Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow

Fragging Rights: A Tale of a Pathological Storage Workload

Unit 5 vsp

Multi core k means

MIRU2016 invited talk - Recovering Transparent Shape from Time-of-Flight Dist...

The Internet

Parallel Implementation of K Means Clustering on CUDA

Neighbourhood Preserving Quantisation for LSH SIGIR Poster

Scaling the #2ndhalf

PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

[2018 GDC] Real-Time Ray-Tracing Techniques for Integration into Existing Ren...

Parallel K means clustering using CUDA

Unite2019 HLOD를 활용한 대규모 씬 제작 방법

TressFX The Fast and The Furry by Nicolas Thibieroz

[BGOUG] Java GC - Friend or Foe

Andere mochten auch

Deep Learning at ScaleIntel Nervana

Intel Nervana Artificial Intelligence Meetup 1/31/17Intel Nervana

Nervana and the Future of ComputingIntel Nervana

Introduction to deep learning @ Startup.ML by Andres RodriguezIntel Nervana

Urs Köster - Convolutional and Recurrent Neural NetworksIntel Nervana

Intel Nervana Artificial Intelligence Meetup 11/30/16Intel Nervana

RE-Work Deep Learning Summit - September 2016Intel Nervana

懇親会の余興スライドAkira Tamamori

clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning AccelerationIntel® Software

Video Activity Recognition and NLP Q&A Model ExampleIntel Nervana

A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...Akira Tamamori

Startup.Ml: Using neon for NLP and Localization Applications Intel Nervana

Using neon for pattern recognition in audio dataIntel Nervana

Urs Köster Presenting at RE-Work DL Summit in BostonIntel Nervana

Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...Intel Nervana

Rethinking computation: A processor architecture for machine intelligenceIntel Nervana

Introduction to Deep Learning with Will ConstableIntel Nervana

Intel's Machine Learning Strategyinside-BigData.com

ODSC WestIntel Nervana

Anil Thomas - Object recognitionIntel Nervana

Andere mochten auch (20)

Deep Learning at Scale

Intel Nervana Artificial Intelligence Meetup 1/31/17

Nervana and the Future of Computing

Introduction to deep learning @ Startup.ML by Andres Rodriguez

Urs Köster - Convolutional and Recurrent Neural Networks

Intel Nervana Artificial Intelligence Meetup 11/30/16

RE-Work Deep Learning Summit - September 2016

懇親会の余興スライド

clCaffe*: Unleashing the Power of Intel Graphics for Deep Learning Acceleration

Video Activity Recognition and NLP Q&A Model Example

A Method of Speech Waveform Synthesis based on WaveNet considering Speech Gen...

Startup.Ml: Using neon for NLP and Localization Applications

Using neon for pattern recognition in audio data

Urs Köster Presenting at RE-Work DL Summit in Boston

Andres Rodriguez at AI Frontiers: Catalyzing Deep Learning's Impact in the En...

Rethinking computation: A processor architecture for machine intelligence

Introduction to Deep Learning with Will Constable

Intel's Machine Learning Strategy

ODSC West

Anil Thomas - Object recognition

Ähnlich wie An Analysis of Convolution for Inference

Visual thinking colin_ware_lectures_2013_3_findabilityElsa von Licy

“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...Edge AI and Vision Alliance

Rainbow Over the Windows: More Colors Than You Could ExpectPeter Hlavaty

Video Compression, Part 2-Section 2, Video Coding Concepts Dr. Mohieddin Moradi

畳み込みについてHironoriKanazawa

#6 PyData Warsaw: Deep learning for image segmentationMatthew Opala

02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...Deltares

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...Edge AI and Vision Alliance

December 4, ProjectUniversity of Colorado at Boulder

7nm "Navi" GPU - A GPU Built For Performance AMD

DL (v2).pptxFKKBWITTAINAN

Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal

Panoramic Video in Environmental Monitoring Software Development and Applica...pycontw

Verifiably RandomDavid Evans

Code vectorization for mobile devicesSt1X

A Deep Dive Into Understanding Apache CassandraDataStax Academy

HBaseCon 2013: Scalable Network Designs for Apache HBaseCloudera, Inc.

Genome Browser based on Google Maps APIHong ChangBum

Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernelAnne Nicolas

Clipping & RasterizationAhmed Daoud

Ähnlich wie An Analysis of Convolution for Inference (20)

Visual thinking colin_ware_lectures_2013_3_findability

“Improving Power Efficiency for Edge Inferencing with Memory Management Optim...

Rainbow Over the Windows: More Colors Than You Could Expect

Video Compression, Part 2-Section 2, Video Coding Concepts

畳み込みについて

#6 PyData Warsaw: Deep learning for image segmentation

02 DSD-NL 2016 - Simona Gebruikersmiddag - Floating point onnauwkeurigheid en...

“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...

December 4, Project

7nm "Navi" GPU - A GPU Built For Performance

DL (v2).pptx

Optimizing the Graphics Pipeline with Compute, GDC 2016

Panoramic Video in Environmental Monitoring Software Development and Applica...

Verifiably Random

Code vectorization for mobile devices

A Deep Dive Into Understanding Apache Cassandra

HBaseCon 2013: Scalable Network Designs for Apache HBase

Genome Browser based on Google Maps API

Kernel Recipes 2019 - Hunting and fixing bugs all over the Linux kernel

Clipping & Rasterization

Kürzlich hochgeladen

A Framework for Development in the AI AgeCprime

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal

Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq

Manual 508 Accessibility Compliance AuditSkynet Technologies

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani

So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Scale your database traffic with Read & Write split using MySQL RouterMydbops

How to write a Business Continuity PlanDatabarracks

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery

Kürzlich hochgeladen (20)

A Framework for Development in the AI Age

DevEX - reference for building teams, processes, and platforms

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...

Genislab builds better products and faster go-to-market with Lean project man...

Manual 508 Accessibility Compliance Audit

The Ultimate Guide to Choosing WordPress Pros and Cons

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

[Webinar] SpiraTest - Setting New Standards in Quality Assurance

How AI, OpenAI, and ChatGPT impact business and software.

Emixa Mendix Meetup 11 April 2024 about Mendix Native development

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Potential of AI (Generative AI) in Business: Learnings and Insights

So einfach geht modernes Roaming fuer Notes und Nomad.pdf

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Scale your database traffic with Read & Write split using MySQL Router

How to write a Business Continuity Plan

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...

An Analysis of Convolution for Inference

1. An Analysis of Convolution for Inference 24 June 2016 Scott Gray Nervana Systems MAKING MACHINES SMARTER.™

2. Proprietary and conﬁdential. Do not distribute.ner va na Direct Convolution 2 • Compute with in-place slicing + gemm • Data layout considerations: C, H, W, N • Minimize slicing logic • Maximize contiguous access • Leverage filter overlap

3. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: Without Superblocking 3 fprop Q = (W-S+1 + 2 * pad) / stride wi = sk + qj * stride - pad Fig from V. Dumoulin, https://github.com/vdumoulin/conv_arithmetic

4. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: With Superblocking 4 fprop Q = (W-S+1 + 2 * pad) / stride wi = sk + qj * stride - pad

5. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: Bprop for deconv 5 bprop pad’ = S - pad - 1 wi = (qj - pad’ + sk) / stride

6. Proprietary and conﬁdential. Do not distribute.ner va na Small N direct convolution: Dilated Filters 6 Dilated S’ = (S-1) * rate + 1 Q = (W-S’+1 + 2*pad) / stride wi = sk * rate + qj * stride - pad Fig from F. Yu, V. Koltun http://arxiv.org/abs/1511.07122v3

7. Proprietary and conﬁdential. Do not distribute.ner va na Convolution with Algorithmic Speedups 7 • FFT and Winograd have same basic computational flow • FFT tiles typically need to be much bigger • Winograd history: Toom and Cook, then Lavin

8. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: input transform 8 Input Feature Map 4x4 stride 2 • Input transform • 2D Winograd is a nested product of 1D transforms • Transforms can be simplified to remove zeros

9. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: filter transform 9 • Filter transform • Same as input but with different coefficients • Transform each feature map independently

10. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: batched GEMM 10 • Point-wise Multiplication • Posed as batched GEMM operation

11. Proprietary and conﬁdential. Do not distribute.ner va na Winograd: output transform 11 Output Feature Map • Output transform • Same as input and filter • Transform back to pixel space to obtain 2x2 output tile

12. Proprietary and conﬁdential. Do not distribute.ner va na Transforms for Increased Accuracy 12 Integer roots 4 0 -5 0 1 0 0 -4 -4 1 1 0 0 4 -4 -1 1 0 0 -2 -1 2 1 0 0 2 -1 -2 1 0 0 4 0 -5 0 1 0.87 0 -2.64 0 1 0 0 -1.4 -2.25 0.62 1 0 0 1.4 -2.25 -0.62 1 0 0 -0.58 -0.39 1.5 1 0 0 0.58 -0.39 -1.5 1 0 0 0.87 0 -2.64 0 1 Fractional roots Input transforms for 4x4

13. Proprietary and conﬁdential. Do not distribute.ner va na Precision 13 Percentage error from Convolution 0 5 10 15 20 25 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Direct 2x2 Winograd 4x4 winograd (Fractional Roots) 4x4 Winograd (Integer Roots) PercentageError Bit width Bits Direct 2x2 Winograd 4x4 frac 4x4 int 3 56.461 112.174 351.196 314.62 4 23.533 46.222 274.28 432.959 5 10.879 21.394 142.649 459.723 6 5.245 10.34 68.062 446.271 7 2.585 5.074 33.73 250.057 8 1.286 2.516 16.667 123.585 9 0.639 1.253 8.246 62.001 10 0.319 0.626 4.154 31.006 11 0.159 0.312 2.064 15.439 12 0.08 0.156 1.029 7.669 13 0.04 0.078 0.515 3.857 14 0.02 0.039 0.259 1.923 15 0.01 0.019 0.129 0.966 16 0.005 0.01 0.064 0.483

14. Proprietary and conﬁdential. Do not distribute.ner va na Multiplier Transistor Efficiency 14 Algo bits speedup transistors performance / transistor Direct 8 1.0 3000 1 2x2 9 2.25 3750 1.8 4x4 12 4.0 6000 2.0 Transistor Counts from Wikipedia:

15. Proprietary and conﬁdential. Do not distribute.ner va na Logarithmic quantization 15 D. Miyashita, EH. Lee, B. Murmann Convolutional Neural Networks using Logarithmic Data Representation http://arxiv.org/abs/1603.01025v2

16. Proprietary and conﬁdential. Do not distribute.ner va na 16 Performance: VGG fp32 on GTX1080effectiveTFLOPS Batch Size VGG - Totals: 0 5 10 15 20 25 64 32 16 8 4 2 1 Neon Direct Neon F(2x2,3x3) Neon F(4x4,3x3) cuDNN FFT

17. Proprietary and conﬁdential. Do not distribute.ner va na 17 Peak Performance: VGG fp32 on GTX1080effectiveTFLOPS Batch Size VGG - Layer 4.2: 0 5 10 15 20 25 64 32 16 8 4 2 1 Neon Direct Neon F(2x2,3x3) Neon F(4x4,3x3) cuDNN FFT