SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Copyright © 2016 LUXOFT 1
Alexey Rybakov, LUXOFT
May 3, 2016
Making Computer Vision Software Run
Fast on Your Embedded Platform
Art and Science of Optimization
Copyright © 2016 LUXOFT 2
Global Software Engineering:
• Low-Power GPU Software
• Custom Vision Software
Why LUXOFT is Giving This Talk
10,000+ Luxoft software engineers
Copyright © 2016 LUXOFT 3
• Obstruction Removal for Drones
• CAFFE on ARM Mali
• OpenCV on ImgTec PowerVR
• HDR Encoding on GPU-based
• Low-power Motion Stabilization
• GPU-optimized 4K VP9 video codec
• See demos at our booth
Our Optimization Projects Covered in This Talk
Drone Vision
Fast
OpenCV
HDR on GPUCaffe on GPU
Stabilization Fast 4K Codecs
Copyright © 2016 LUXOFT 4
• Qualifying question: Who Develops Computer Vision Software?
• Typical situations in embedded SW development:
• Great new algorithm  Implement
• Implementation platform: Desktop-class  Embedded*
• Decision making: Delayed  Real-time*
• Performance: Low FPS  High FPS*
Poll
* Context of this presentation
Copyright © 2016 LUXOFT 5
• Need: reliable, real-time, on-device, decision-making from visual
data...implemented on a constrained HW platform (with exotic architecture)
• What to do
1. Map CV pipeline onto HW platform
2. Rethink system requirements
3. Rework algorithm logic
4. Use GPU, DSP and other aid (properly!)
5. Code optimization
6. Know your platform inside out
Embedded Vision: Challenges and Opportunities
Copyright © 2016 LUXOFT 6
Map CV Pipeline onto HW Platform
1.
Copyright © 2016 LUXOFT 7
Embedded Vision: Pipeline and Hardware
Copyright © 2016 LUXOFT 8
Evaluate your platform:
• Hardware features and accelerators, slow/fast memory, power management?
• Support from run-time: OS, drivers, OpenCL, CUDA, other frameworks?
• Toolchain: Compiler, debugger, profiler, [access to] documentation, optimization guides?
• Available CV frameworks: OpenCV, IPP, fastCV, other?
Benchmark your embedded platform vs. reference:
• Run simple tests: data copy, access, vectorization, memory use, energy management
• Test if CV-framework functions are optimized (coverage is often low)
…This will give you measured optimization goal
Study and Test HW Platform
Copyright © 2016 LUXOFT 9
Mapping to Platform: Histogram Example
Histo*
2 ms
Histo
equali-
zation
Apply
LUT
Histo
4.2 ms
Histo
equalization
Apply
LUT
Camera
Camera
* Histogram collection on CPU is more than 2 times faster than on GPU
** Histogram equalization is a 1 thread, iterative histogram processing, so
GPU implementation is not reasonable.
16.2 ms
2 MB data transfer (HD frame)
1 KB data transfer 1 KB data transfer
1 KB data transfer
GPU processing
CPU processing
Memory transfers
HOST  GPU = 1.33 GB/s
GPU HOST = 0.11 GB/s
SOC: Intel Merrifield platform,
Device: Dell Venue 3840
Option A vs.
Option B
Copyright © 2016 LUXOFT 10
Rethink System Requirements!
2.
Copyright © 2016 LUXOFT 11
• Important concept: “Good enough”
• How does your use case differ from classic/desktop requirements?
Art of “controlled worse”
• What decision latency do you need?
• What resolution/precision?
• Do you need all frame or a region?
Optimize System Requirements
Copyright © 2016 LUXOFT 12
• Universal implementation*  Our Drone implementation
• Any motion  Linear motion
• Any obstacles  Opaque obstacles
• Have only image data  Use sensor fusion (gyro)
• More than 100X faster!
Rethink Requirements:
Obstruction Removal, Drone Edition
Camera Output
*MIT CSAIL and Google Research, SIGGRAPH 2015
Copyright © 2016 LUXOFT 13
Rework Algorithm Logic
3.
Copyright © 2016 LUXOFT 14
• Desktop  Embedded
• High-Res  Downsampling / pyramid
• Color  Monochrome or luminance
• Entire frame  Regions of Interest only
• ROI cascading example: HOG to DNN
• Every frame  1/N + approximation
• Inter-frame cascading: Detection to Tracking
• Image only  Sensor fusion
• Example: gyro + vision for motion est.
• CPU  Parallelize for GPU
Algorithm Optimization Opportunities
Copyright © 2016 LUXOFT 15
• Motion Vector Field only for 3x3
(pyramid downsampling)
• Only shift and rotation
•  1000x+ performance
•  Real-time 4K UHD on mobile
Optimized Video Stabilization Algorithm
• Motion Vector Field only for 3x3
grid (pyramid downsampling)
• Only shift and rotation
• Inter-frame border reconstruction
•  1000x+ performance
•  Real-time 4K UHD on mobile
Copyright © 2016 LUXOFT 16
Use GPU and Other Aid (Properly)
4.
Copyright © 2016 LUXOFT 17
• Good news: computer vision is very parallelizable
• Bad news: coordination between CPU and GPU (and other compute devices) is a tricky part
• GPU: What to do (beyond algorithm-to-platform mapping and reworked logic)
• A few simple rules: memory types, datatypes, workroup size, memory alignment
• Master the art of kernel synchronization: load your cores
• Use GPU pre-optimized libraries, like OpenCV on some platforms
• Master OpenCL
• Also explore available ISP or DSP benefits.
Use GPU. Properly
Copyright © 2016 LUXOFT 18
1. Memory Hierarchy
2. Task Synchronization
• Example of both: Large Matrix Transpose
GPU, Two Key Concepts
Copyright © 2016 LUXOFT 19
Original. All FPS measured on Galaxy S7:
• Run existing DNN framework: CAFFE
• =0.7 FPS (EIGEN OpenCL library)
CPU Optimization (not a through road):
• Optimized version for Android: DNN optimized OpenBLAS:
OpenMP and NEON  +2 FPS
GPU Optimizations:
• Better OpenCL implementation on ViennaCL library  +0.5 FPS
• Found bottleneck: SGEMM functions
•  Rewrite SGEMM (workgroup size, vectorization, etc)  +4.5 FPS
Final optimized performance: 5-6 FPS
ARM Mali Accelerated CAFFE
Open Source CPU,
1 thread
Open Source GPU
OpenCL
(ViennaCL)
Open Source CPU
multithreaded,
NEON
LUXOFT
0.7 FPS 1.2 FPS 2.5 FPS 5.4 FPS
Copyright © 2016 LUXOFT 20
ARM Mali Accelerated CAFFE: Benchmarks
Legend
Colors
• FPS
• CPU Load
• Battery Charge
Lines
• CPU
• Optimized GPU
Copyright © 2016 LUXOFT 21
VP9 Video Decoder Optimization for GPU
Parsing &
Entropy
Decode
Motion
Compen
sation
Intra
Prediction
Inverse
Quant
Inverse
Transform
Reconst
ruction
Loop
filtering
• CPU: Superblock-level parallelism
Parsing &
Entropy
Decode
Motion
Compensati
on
Intra
Prediction
Inverse
Quant
Inverse
Transform
Reconstructi
on
Loop
filtering
• GPU: Frame-level parallelism
• Uses more memory
Input frame
Input frame Output frame
Output frame
Optimization result: 2x-5x FPS depending on bitrate.
Platforms: AMD, Intel, NVidia SoCs
Original CPU Algorithm
GPU processing
CPU processing
Reworked and Optimized GPU Algorithm
Copyright © 2016 LUXOFT 22
Code Optimization
5.
Copyright © 2016 LUXOFT 23
• Two enemies
1. Computation
2. Data transfers
• Waste of time = waste of energy
Controversial example 
ARM compiler does it automatically
Some others don’t
Two Enemies: Code and Data
Don’t calculate - Use table/lookup functions,
- Use polynomial approximations
Use classic techniques - Like loop unrolling,
- Converting to native data types
Don’t move data - Use local and cache memory
- Partition/group DRAM access
Benchmark everything - Compiler computation options
- Memory transfers
Copyright © 2016 LUXOFT 24
OpenCV local contrast for HD camera adjustment in real time
• Existing OpenCV histogram implementations don‘t fit into
1080p frame processing budget (need 16 ms/frame for the entire
algoithm chain to obtain 60 FPS)
Optimization Results 
Things to do
• Experiment
• Benchmark
• Chose the best method
OpenCV on ImgTec PowerVR GPU: Histogram Example
Histogram Gathering Method Time, ms
OpenCV histogram (CPU) 7.5 ms
OpenCV histogram (GPU) 4.4 ms
Luxoft-PowerVR (atomic_add to global memory) 0.69 ms
Luxoft-PowerVR (atomic_add to local memory) 7.51 ms
Luxoft-PowerVR (increment at local memory) 3.28 ms
Copyright © 2016 LUXOFT 25
• Example: “memory tiling”
Tiled memory layout may
give 2x-3x performance gain
for vision algorithms:
1 DRAM read vs. 4 DRAM reads
in matrix transpose
Example: Fighting Data Transfers
• Reference you need to obtain or produce
(will vary by CPU/GPU of your choice)
Copyright © 2016 LUXOFT 26
Know Your Platform Inside Out
6.
Copyright © 2016 LUXOFT 27
• Things to do
• Study documentation and optimization guides for your exact HW
• Again, test/benchmark a feature before you critically rely on it
• What works for you
• Modern GPUs and DSPs may implement the entire algorithm in 1 instruction
• What works against you
• Don’t assume everything will work as documented
• “Fast” memory …may be slow (like early versions of Snapdragon)
• Great technology …but no documentation and no code examples (like iOS
Metal for compute)
Platform Specifics
Copyright © 2016 LUXOFT 28
• Motion vector field upsampling, common task for CV
• OpenCL supports bilinear
interpolation of everything
• How to, AMD OpenCL implementation
• AMD has QSAD function – the fastest way to SAD for blocks
• Keep MVF in Image2D
• Use sampler with CLK_FILTER_BILINEAR
Platform Example: AMD GPU for Frame Interpolation
Basic Optimized
Copyright © 2016 LUXOFT 29
iOS Metal Compute Findings:
• No code examples for compute, weak documentation = blackbox
• Only 64 GitHub repos, no serious projects
• xCode profiler does not work with Metal Compute  use workarounds: manual timer-based
profiling
• Vector types actually not fully supported by a compiler  test everything, then use
workaround: use combined approach with scalars and vectors
Encountered while working on GPU-optimized
JPEG-HDR encoding on iPhone
We still achieved about 3x-4x faster JPEG Encode
on iPhone … just took a lot of extra work
Platform Example: Apple iOS Metal for GPU Compute
Copyright © 2016 LUXOFT 30
Lessons Learned and Resources
!
Copyright © 2016 LUXOFT 31
1. Learn, test, profile, and benchmark every component of your system. Including
compiler. Don’t assume.
2. Don’t port 1:1. Rework requirements and algorithm logic too.
3. GPU and other non-CPU compute architectures may give fantastic results.
4. Use parallelization and computer vision frameworks like OpenCL or OpenCV.
Rewrite critical parts there as needed.
5. Modern HW platforms implement popular algorithms in one function call. Study
platform-specific optimization guides.
6. Sometimes things won’t work as documented. This is normal.
7. Optimization is a mix of art and science. Think outside the box.
Lessons Learned
Copyright © 2016 LUXOFT 32
• Embedded Vision Alliance: http://www.embedded-vision.com/
• Platform optimization guides and blog posts from:
• Altera (now Intel), AMD, ARM, Imagination Technologies, NVidia,
Qualcomm, TI
• Luxoft Computer Vision team: vision@luxoft.com
Resources
Copyright © 2016 LUXOFT 33
Thank you!
LUXOFT Presentation R&D Team:
Aleksandr Bobrovnik
Aleksandr Volkov
Alexey Rybakov
Anton Veselov
Artem Galin
Dmitriy Marenkov
Dmitry Ivanov
Ekaterina Popova
Ihor Starepravo
Ildar Valiev
Marat Gilmutdinov
Nikolay Nemcev
Oleksandr Murovanyi
Sergey Fedorov
Valery Bobrov
Viktor Pasoshnikov
Copyright © 2016 LUXOFT 34
See demos at our booth. And email me too
?Alexey Rybakov
Senior Director, Embedded
LUXOFT, Menlo Park, CA
ARybakov@luxoft.com

Weitere ähnliche Inhalte

Was ist angesagt?

PG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry KozlovPG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry Kozlov
AMD Developer Central
 

Was ist angesagt? (20)

GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
Nimbix: Cloud for the Missing Middle
Nimbix: Cloud for the Missing MiddleNimbix: Cloud for the Missing Middle
Nimbix: Cloud for the Missing Middle
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate Arrays
 
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
PT-4058, Measuring and Optimizing Performance of Cluster and Private Cloud Ap...
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
 
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,..."Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
"Implementing Histogram of Oriented Gradients on a Parallel Vision Processor,...
 
PG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry KozlovPG-4039, RapidFire API, by Dmitry Kozlov
PG-4039, RapidFire API, by Dmitry Kozlov
 
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey PavlenkoMM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
MM-4097, OpenCV-CL, by Harris Gasparakis, Vadim Pisarevsky and Andrey Pavlenko
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
Keynote (Phil Rogers) - The Programmers Guide to Reaching for the Cloud - by ...
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
 
“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org
“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org
“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org
 
HSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben GasterHSA-4123, HSA Memory Model, by Ben Gaster
HSA-4123, HSA Memory Model, by Ben Gaster
 
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
RenderMan*: The Role of Open Shading Language (OSL) with Intel® Advanced Vect...
 
Design and Optimize your code for high-performance with Intel® Advisor and I...
Design and Optimize your code for high-performance with Intel®  Advisor and I...Design and Optimize your code for high-performance with Intel®  Advisor and I...
Design and Optimize your code for high-performance with Intel® Advisor and I...
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
 
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
 

Ähnlich wie "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Add sale davinci
Add sale davinciAdd sale davinci
Add sale davinci
Akash Sahoo
 

Ähnlich wie "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT (20)

Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
AWS re:Invent 2016: Deep Learning, 3D Content Rendering, and Massively Parall...
 
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
XPDDS17: Keynote: Shared Coprocessor Framework on ARM - Oleksandr Andrushchen...
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
Ximea - the pc camera, 90 gflps smart camera
Ximea  - the pc camera, 90 gflps smart cameraXimea  - the pc camera, 90 gflps smart camera
Ximea - the pc camera, 90 gflps smart camera
 
Add sale davinci
Add sale davinciAdd sale davinci
Add sale davinci
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Introduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AIIntroduction to HPC & Supercomputing in AI
Introduction to HPC & Supercomputing in AI
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++
 
Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509Deep Learning on ARM Platforms - SFO17-509
Deep Learning on ARM Platforms - SFO17-509
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
Introducing Container Technology to TSUBAME3.0 Supercomputer
Introducing Container Technology to TSUBAME3.0 SupercomputerIntroducing Container Technology to TSUBAME3.0 Supercomputer
Introducing Container Technology to TSUBAME3.0 Supercomputer
 
GPU Algorithms and trends 2018
GPU Algorithms and trends 2018GPU Algorithms and trends 2018
GPU Algorithms and trends 2018
 

Mehr von Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
Edge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
Edge AI and Vision Alliance
 

Mehr von Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

"Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

  • 1. Copyright © 2016 LUXOFT 1 Alexey Rybakov, LUXOFT May 3, 2016 Making Computer Vision Software Run Fast on Your Embedded Platform Art and Science of Optimization
  • 2. Copyright © 2016 LUXOFT 2 Global Software Engineering: • Low-Power GPU Software • Custom Vision Software Why LUXOFT is Giving This Talk 10,000+ Luxoft software engineers
  • 3. Copyright © 2016 LUXOFT 3 • Obstruction Removal for Drones • CAFFE on ARM Mali • OpenCV on ImgTec PowerVR • HDR Encoding on GPU-based • Low-power Motion Stabilization • GPU-optimized 4K VP9 video codec • See demos at our booth Our Optimization Projects Covered in This Talk Drone Vision Fast OpenCV HDR on GPUCaffe on GPU Stabilization Fast 4K Codecs
  • 4. Copyright © 2016 LUXOFT 4 • Qualifying question: Who Develops Computer Vision Software? • Typical situations in embedded SW development: • Great new algorithm  Implement • Implementation platform: Desktop-class  Embedded* • Decision making: Delayed  Real-time* • Performance: Low FPS  High FPS* Poll * Context of this presentation
  • 5. Copyright © 2016 LUXOFT 5 • Need: reliable, real-time, on-device, decision-making from visual data...implemented on a constrained HW platform (with exotic architecture) • What to do 1. Map CV pipeline onto HW platform 2. Rethink system requirements 3. Rework algorithm logic 4. Use GPU, DSP and other aid (properly!) 5. Code optimization 6. Know your platform inside out Embedded Vision: Challenges and Opportunities
  • 6. Copyright © 2016 LUXOFT 6 Map CV Pipeline onto HW Platform 1.
  • 7. Copyright © 2016 LUXOFT 7 Embedded Vision: Pipeline and Hardware
  • 8. Copyright © 2016 LUXOFT 8 Evaluate your platform: • Hardware features and accelerators, slow/fast memory, power management? • Support from run-time: OS, drivers, OpenCL, CUDA, other frameworks? • Toolchain: Compiler, debugger, profiler, [access to] documentation, optimization guides? • Available CV frameworks: OpenCV, IPP, fastCV, other? Benchmark your embedded platform vs. reference: • Run simple tests: data copy, access, vectorization, memory use, energy management • Test if CV-framework functions are optimized (coverage is often low) …This will give you measured optimization goal Study and Test HW Platform
  • 9. Copyright © 2016 LUXOFT 9 Mapping to Platform: Histogram Example Histo* 2 ms Histo equali- zation Apply LUT Histo 4.2 ms Histo equalization Apply LUT Camera Camera * Histogram collection on CPU is more than 2 times faster than on GPU ** Histogram equalization is a 1 thread, iterative histogram processing, so GPU implementation is not reasonable. 16.2 ms 2 MB data transfer (HD frame) 1 KB data transfer 1 KB data transfer 1 KB data transfer GPU processing CPU processing Memory transfers HOST  GPU = 1.33 GB/s GPU HOST = 0.11 GB/s SOC: Intel Merrifield platform, Device: Dell Venue 3840 Option A vs. Option B
  • 10. Copyright © 2016 LUXOFT 10 Rethink System Requirements! 2.
  • 11. Copyright © 2016 LUXOFT 11 • Important concept: “Good enough” • How does your use case differ from classic/desktop requirements? Art of “controlled worse” • What decision latency do you need? • What resolution/precision? • Do you need all frame or a region? Optimize System Requirements
  • 12. Copyright © 2016 LUXOFT 12 • Universal implementation*  Our Drone implementation • Any motion  Linear motion • Any obstacles  Opaque obstacles • Have only image data  Use sensor fusion (gyro) • More than 100X faster! Rethink Requirements: Obstruction Removal, Drone Edition Camera Output *MIT CSAIL and Google Research, SIGGRAPH 2015
  • 13. Copyright © 2016 LUXOFT 13 Rework Algorithm Logic 3.
  • 14. Copyright © 2016 LUXOFT 14 • Desktop  Embedded • High-Res  Downsampling / pyramid • Color  Monochrome or luminance • Entire frame  Regions of Interest only • ROI cascading example: HOG to DNN • Every frame  1/N + approximation • Inter-frame cascading: Detection to Tracking • Image only  Sensor fusion • Example: gyro + vision for motion est. • CPU  Parallelize for GPU Algorithm Optimization Opportunities
  • 15. Copyright © 2016 LUXOFT 15 • Motion Vector Field only for 3x3 (pyramid downsampling) • Only shift and rotation •  1000x+ performance •  Real-time 4K UHD on mobile Optimized Video Stabilization Algorithm • Motion Vector Field only for 3x3 grid (pyramid downsampling) • Only shift and rotation • Inter-frame border reconstruction •  1000x+ performance •  Real-time 4K UHD on mobile
  • 16. Copyright © 2016 LUXOFT 16 Use GPU and Other Aid (Properly) 4.
  • 17. Copyright © 2016 LUXOFT 17 • Good news: computer vision is very parallelizable • Bad news: coordination between CPU and GPU (and other compute devices) is a tricky part • GPU: What to do (beyond algorithm-to-platform mapping and reworked logic) • A few simple rules: memory types, datatypes, workroup size, memory alignment • Master the art of kernel synchronization: load your cores • Use GPU pre-optimized libraries, like OpenCV on some platforms • Master OpenCL • Also explore available ISP or DSP benefits. Use GPU. Properly
  • 18. Copyright © 2016 LUXOFT 18 1. Memory Hierarchy 2. Task Synchronization • Example of both: Large Matrix Transpose GPU, Two Key Concepts
  • 19. Copyright © 2016 LUXOFT 19 Original. All FPS measured on Galaxy S7: • Run existing DNN framework: CAFFE • =0.7 FPS (EIGEN OpenCL library) CPU Optimization (not a through road): • Optimized version for Android: DNN optimized OpenBLAS: OpenMP and NEON  +2 FPS GPU Optimizations: • Better OpenCL implementation on ViennaCL library  +0.5 FPS • Found bottleneck: SGEMM functions •  Rewrite SGEMM (workgroup size, vectorization, etc)  +4.5 FPS Final optimized performance: 5-6 FPS ARM Mali Accelerated CAFFE Open Source CPU, 1 thread Open Source GPU OpenCL (ViennaCL) Open Source CPU multithreaded, NEON LUXOFT 0.7 FPS 1.2 FPS 2.5 FPS 5.4 FPS
  • 20. Copyright © 2016 LUXOFT 20 ARM Mali Accelerated CAFFE: Benchmarks Legend Colors • FPS • CPU Load • Battery Charge Lines • CPU • Optimized GPU
  • 21. Copyright © 2016 LUXOFT 21 VP9 Video Decoder Optimization for GPU Parsing & Entropy Decode Motion Compen sation Intra Prediction Inverse Quant Inverse Transform Reconst ruction Loop filtering • CPU: Superblock-level parallelism Parsing & Entropy Decode Motion Compensati on Intra Prediction Inverse Quant Inverse Transform Reconstructi on Loop filtering • GPU: Frame-level parallelism • Uses more memory Input frame Input frame Output frame Output frame Optimization result: 2x-5x FPS depending on bitrate. Platforms: AMD, Intel, NVidia SoCs Original CPU Algorithm GPU processing CPU processing Reworked and Optimized GPU Algorithm
  • 22. Copyright © 2016 LUXOFT 22 Code Optimization 5.
  • 23. Copyright © 2016 LUXOFT 23 • Two enemies 1. Computation 2. Data transfers • Waste of time = waste of energy Controversial example  ARM compiler does it automatically Some others don’t Two Enemies: Code and Data Don’t calculate - Use table/lookup functions, - Use polynomial approximations Use classic techniques - Like loop unrolling, - Converting to native data types Don’t move data - Use local and cache memory - Partition/group DRAM access Benchmark everything - Compiler computation options - Memory transfers
  • 24. Copyright © 2016 LUXOFT 24 OpenCV local contrast for HD camera adjustment in real time • Existing OpenCV histogram implementations don‘t fit into 1080p frame processing budget (need 16 ms/frame for the entire algoithm chain to obtain 60 FPS) Optimization Results  Things to do • Experiment • Benchmark • Chose the best method OpenCV on ImgTec PowerVR GPU: Histogram Example Histogram Gathering Method Time, ms OpenCV histogram (CPU) 7.5 ms OpenCV histogram (GPU) 4.4 ms Luxoft-PowerVR (atomic_add to global memory) 0.69 ms Luxoft-PowerVR (atomic_add to local memory) 7.51 ms Luxoft-PowerVR (increment at local memory) 3.28 ms
  • 25. Copyright © 2016 LUXOFT 25 • Example: “memory tiling” Tiled memory layout may give 2x-3x performance gain for vision algorithms: 1 DRAM read vs. 4 DRAM reads in matrix transpose Example: Fighting Data Transfers • Reference you need to obtain or produce (will vary by CPU/GPU of your choice)
  • 26. Copyright © 2016 LUXOFT 26 Know Your Platform Inside Out 6.
  • 27. Copyright © 2016 LUXOFT 27 • Things to do • Study documentation and optimization guides for your exact HW • Again, test/benchmark a feature before you critically rely on it • What works for you • Modern GPUs and DSPs may implement the entire algorithm in 1 instruction • What works against you • Don’t assume everything will work as documented • “Fast” memory …may be slow (like early versions of Snapdragon) • Great technology …but no documentation and no code examples (like iOS Metal for compute) Platform Specifics
  • 28. Copyright © 2016 LUXOFT 28 • Motion vector field upsampling, common task for CV • OpenCL supports bilinear interpolation of everything • How to, AMD OpenCL implementation • AMD has QSAD function – the fastest way to SAD for blocks • Keep MVF in Image2D • Use sampler with CLK_FILTER_BILINEAR Platform Example: AMD GPU for Frame Interpolation Basic Optimized
  • 29. Copyright © 2016 LUXOFT 29 iOS Metal Compute Findings: • No code examples for compute, weak documentation = blackbox • Only 64 GitHub repos, no serious projects • xCode profiler does not work with Metal Compute  use workarounds: manual timer-based profiling • Vector types actually not fully supported by a compiler  test everything, then use workaround: use combined approach with scalars and vectors Encountered while working on GPU-optimized JPEG-HDR encoding on iPhone We still achieved about 3x-4x faster JPEG Encode on iPhone … just took a lot of extra work Platform Example: Apple iOS Metal for GPU Compute
  • 30. Copyright © 2016 LUXOFT 30 Lessons Learned and Resources !
  • 31. Copyright © 2016 LUXOFT 31 1. Learn, test, profile, and benchmark every component of your system. Including compiler. Don’t assume. 2. Don’t port 1:1. Rework requirements and algorithm logic too. 3. GPU and other non-CPU compute architectures may give fantastic results. 4. Use parallelization and computer vision frameworks like OpenCL or OpenCV. Rewrite critical parts there as needed. 5. Modern HW platforms implement popular algorithms in one function call. Study platform-specific optimization guides. 6. Sometimes things won’t work as documented. This is normal. 7. Optimization is a mix of art and science. Think outside the box. Lessons Learned
  • 32. Copyright © 2016 LUXOFT 32 • Embedded Vision Alliance: http://www.embedded-vision.com/ • Platform optimization guides and blog posts from: • Altera (now Intel), AMD, ARM, Imagination Technologies, NVidia, Qualcomm, TI • Luxoft Computer Vision team: vision@luxoft.com Resources
  • 33. Copyright © 2016 LUXOFT 33 Thank you! LUXOFT Presentation R&D Team: Aleksandr Bobrovnik Aleksandr Volkov Alexey Rybakov Anton Veselov Artem Galin Dmitriy Marenkov Dmitry Ivanov Ekaterina Popova Ihor Starepravo Ildar Valiev Marat Gilmutdinov Nikolay Nemcev Oleksandr Murovanyi Sergey Fedorov Valery Bobrov Viktor Pasoshnikov
  • 34. Copyright © 2016 LUXOFT 34 See demos at our booth. And email me too ?Alexey Rybakov Senior Director, Embedded LUXOFT, Menlo Park, CA ARybakov@luxoft.com