SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Optimizing computer vision problems on mobile platforms
Looksery.com
Fedor Polyakov
Software Engineer, CIO
Looksery, INC
fedor@looksery.com
+380 97 5900009 (mobile)
www.looksery.com
Optimize algorithm first
• If your algorithm is suboptimal, “technical” optimizations won’t
be as effective as just algo fixes
• When you optimize the algorithm, you’d probably have to
change your technical optimizations too
• Single instruction - multiple data
• On NEON, 16x128-bit wide registers (up to 4 int32_t’s/floats, 2 doubles)
• Uses a bit more cycles per instruction, but can operate on a lot more data
• Can ideally give the performance boost of up to 4x times (typically, in my
practice ~2-3x)
• Can be used for many image processing algorithms
• Especially useful at various linear algebra problems
SIMD operations
• The easiest way - you just use the library and it does everything for you
• Eigen - great header-only library for linear algebra
• Ne10 - neon-optimized library for some image processing/DSP on android
• Accelerate.framework - lots of image processing/DSP on iOS
• OpenCV, unfortunately, is quite weakly optimized for ARM SIMD (though,
they’ve optimized ~40 low-level functions in OpenCV 3.0)
• There are also some commercial libraries
• + Everything is done without any your efforts
• - You should still profile and analyze the ASM code to verify that everything
is vectorized as you expect
Using computer vision/algebra/DSP libraries
using v4si = int __attribute__ ((vector_size (VECTOR_SIZE_IN_BYTES)));
v4si x, y;
• All common operations with x are now vectorized
• Written once and for all architectures
• Operations supported +, -, *, /, unary minus, ^, |, &, ~, %, <<, >>, comparisons
• Loading from memory in a way like this x = *((v4si*)ptr);
• Loading back to memory in a way like this *((v4si*)ptr) = x;
• Supports subscript operator for accessing individual elements
• Not all SIMD operations supported
• May produce suboptimal code
GCC/clang vector extensions
• Provide a custom data types and a set of c functions to vectorize code
• Example: float32x4_t vrsqrtsq_f32(float32x4_t a, float32x4_t b);
• Generally, are similar to previous approach though give you a better control and
full instruction set.
• Cons:
• Have to write separate code for each platform
• In all the above approaches, compiler may inject some instructions which
can be avoided in hand-crafted code
• Compiler might generate code that won’t use the pipeline efficiently
SIMD intrinsics
• Gives you the most control - you know what code will be generated
• So, if created carefully, can sometimes be up to 2 times faster than the code
generated by compiler using previous approaches (usually 10-15% though)
• You need to write separate code for each architecture :(
• Need to learn
• Harder to create
• In order to get the maximum performance possible, some additional steps may
be required
Handcrafted ASM code
• Reduce data types to as small as possible
• If you can change double to int16_t, you’ll get more than 4x performance boost
• Try using pld intrinsic - it “hints” CPU to load some data into caches which will be
used in a near future (can be used as __builtin_prefetch)
• If you use intrinsics, watch out for some extra loads/stores which you may be
able to get rid of
• Use loop unrolling
• Interleave load/store instructions and arithmetical operations
• Use proper memory alignment - can cause crashes/slow down performance
Some other tricks
• Sum of matrix rows
• Matrices are 128x128, test is repeated 10^5 times
Some benchmarks
// Non-vectorized code
for (int i = 0; i < matSize; i++) {
for (int j = 0; j < matSize; j++) {
rowSum[j] += testMat[i][j];
}
}
// Vectorized code
for (int i = 0; i < matSize; i++) {
for (int j = 0; j < matSize; j += vectorSize) {
VectorType x = *(VectorType*)(testMat[i] + j);
VectorType y = *(VectorType*)(rowSum + j);
y += x;
*(VectorType*)(rowSum + j) = y;
}
}
Some benchmarks
Tested on iPhone 5, results on other phones show pretty much the same
0
1
2
3
4
5
6
7
8
9
10
Simple Vectorized
Time,s int float short
Got more than 2x performance boost, mission completed?
Some benchmarks
0
1
2
3
4
5
6
7
8
9
10
Simple Vectorized Loop unroll
Time,s
int float short
Got another ~15%
for (int i = 0; i < matSize; i++) {
auto ptr = testMat[i];
for (int j = 0; j < matSize; j += 4 * xSize) {
auto ptrStart = ptr + j;
VT x1 = *(VT*)(ptrStart + 0 * xSize);
VT y1 = *(VT*)(rowSum + j + 0 * xSize);
y1 += x1;
VT x2 = *(VT*)(ptrStart + 1 * xSize);
VT y2 = *(VT*)(rowSum + j + 1 * xSize);
y2 += x2;
VT x3 = *(VT*)(ptrStart + 2 * xSize);
VT y3 = *(VT*)(rowSum + j + 2 * xSize);
y3 += x3;
VT x4 = *(VT*)(ptrStart + 3 * xSize);
VT y4 = *(VT*)(rowSum + j + 3 * xSize);
y4 += x4;
*(VT*)(rowSum + j + 0 * xSize) = y1;
*(VT*)(rowSum + j + 1 * xSize) = y2;
*(VT*)(rowSum + j + 2 * xSize) = y3;
*(VT*)(rowSum + j + 3 * xSize) = y4;
}
}
Some benchmarks
Let’s take a look at profiler
Some benchmarks
// Non-vectorized code
for (int i = 0; i < matSize; i++) {
for (int j = 0; j < matSize; j++) {
rowSum[i] += testMat[j][i];
}
}
// Vectorized, loop-unrolled code
for (int i = 0; i < matSize; i+=4 * xSize) {
VT y1 = *(VT*)(rowSum + i);
VT y2 = *(VT*)(rowSum + i + xSize);
VT y3 = *(VT*)(rowSum + i + 2*xSize);
VT y4 = *(VT*)(rowSum + i + 3*xSize);
for (int j = 0; j < matSize; j ++) {
x1 = *(VT*)(testMat[j] + i);
x2 = *(VT*)(testMat[j] + i + xSize);
x3 = *(VT*)(testMat[j] + i + 2*xSize);
x4 = *(VT*)(testMat[j] + i + 3*xSize);
y1 += x1;
y2 += x2;
y3 += x3;
y4 += x4;
}
*(VT*)(rowSum + i) = y1;
*(VT*)(rowSum + i + xSize) = y2;
*(VT*)(rowSum + i + 2*xSize) = y3;
*(VT*)(rowSum + i + 3*xSize) = y4;
}
Some benchmarks
0
1
2
3
4
5
6
7
8
9
10
Simple Vect + Loop
Time,s
int float Short
Some benchmarks
0
1
2
3
4
5
6
7
8
9
10
Simple Vectorized Vect + Loop Eigen SumOrder Asm
Time,s
float
Using GPGPU
• Around 1.5 orders of magnitude bigger theoretical performance
• On iPhone 5, CPU has like ~800 MFlops, GPU has 28.8 GFlops
• On iPhone 5S, CPU has 1.5~ GFlops, GPU has 76.4 GFlops !
• Can be very hard to utilize efficiently
• CUDA, obviously, isn’t available on mobile devices
• OpenCL isn’t available on iOS and is hardly available on android
• On iOS, Metal is available for GPGPU but only starting with iPhone 5S
• On Android, Google promotes Renderscript for GPGPU
• So, the only cross-platform way is to use OpenGL ES (2.0)
Common usage of shaders for GPGPU
Shader 1
Image
Data
Texture containing processed data
Shader 2
…
Data
Results
Display on screen
Read back to cpu
Common problems
• Textures were designed to hold RGBA8 data
• On almost all phones starting 2012, half-float and float textures are supported as
input
• Effective bilinear filtering for float textures may be unsupported or ineffective
• On many devices, writing from fragment shader to half-float (16 bit) textures is
supported.
• Emulating the fixed-point arithmetic is pretty straightforward
• Emulating floating-point is possible, but a bit tricky and requires more operations
• Change of OpenGL states may be expensive
• For-loops with non-const number of iterations not supported on older devices
• Reading from GPU to CPU is very expensive
• There are some platform-dependent way to make it faster
Tasks that can be solved on OpenGL ES
• Image processing
• Image binarization
• Edge detection (Sobel, Canny)
• Hough transform (though, some parts can’t be implemented on GPU)
• Histogram equalization
• Gaussian blur/other convolutions
• Colorspace conversions
• Much more examples in GPUImage library for iOS
• For other tasks, it depends on many factors
• We tried to implement our tracking on GPU, but didn’t get the expected
performance boost
Questions?
Thanks for attention!

Weitere ähnliche Inhalte

Was ist angesagt?

Challenges in Embedded Development
Challenges in Embedded DevelopmentChallenges in Embedded Development
Challenges in Embedded DevelopmentSQABD
 
Con-FESS 2015 - Is your profiler speaking to you?
Con-FESS 2015 - Is your profiler speaking to you?Con-FESS 2015 - Is your profiler speaking to you?
Con-FESS 2015 - Is your profiler speaking to you?Anton Arhipov
 
Getting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® GraphicsGetting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® GraphicsIntel® Software
 
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for ArtistsOwen Wu
 
GPU Pipeline - Realtime Rendering CH3
GPU Pipeline - Realtime Rendering CH3GPU Pipeline - Realtime Rendering CH3
GPU Pipeline - Realtime Rendering CH3Aries Cs
 
[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for Artist[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for ArtistOwen Wu
 
Minimizing CPU Shortage Risks in Integrated Embedded Software
Minimizing CPU Shortage Risks in Integrated Embedded SoftwareMinimizing CPU Shortage Risks in Integrated Embedded Software
Minimizing CPU Shortage Risks in Integrated Embedded SoftwareLionel Briand
 
Engineering show and tell
Engineering show and tellEngineering show and tell
Engineering show and tellrasen58
 
Event Driven with LibUV and ZeroMQ
Event Driven with LibUV and ZeroMQEvent Driven with LibUV and ZeroMQ
Event Driven with LibUV and ZeroMQLuke Luo
 
Memory Leak Analysis in Android Games
Memory Leak Analysis in Android GamesMemory Leak Analysis in Android Games
Memory Leak Analysis in Android GamesHeghine Hakobyan
 
Unity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studioUnity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studioOwen Wu
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionCherryBerry2
 
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework  by Alex Sergeev from UberHorovod ubers distributed deep learning framework  by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework by Alex Sergeev from UberBill Liu
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science Domino Data Lab
 
BruCON 2010 Lightning Talks - DIY Grid Computing
BruCON 2010 Lightning Talks - DIY Grid ComputingBruCON 2010 Lightning Talks - DIY Grid Computing
BruCON 2010 Lightning Talks - DIY Grid Computingtomaszmiklas
 
Tw2010slide2
Tw2010slide2Tw2010slide2
Tw2010slide2s1150036
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-ServiceHiroshi Doyu
 

Was ist angesagt? (20)

Challenges in Embedded Development
Challenges in Embedded DevelopmentChallenges in Embedded Development
Challenges in Embedded Development
 
Con-FESS 2015 - Is your profiler speaking to you?
Con-FESS 2015 - Is your profiler speaking to you?Con-FESS 2015 - Is your profiler speaking to you?
Con-FESS 2015 - Is your profiler speaking to you?
 
Getting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® GraphicsGetting Space Pirate Trainer* to Perform on Intel® Graphics
Getting Space Pirate Trainer* to Perform on Intel® Graphics
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
 
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
 
GPU Pipeline - Realtime Rendering CH3
GPU Pipeline - Realtime Rendering CH3GPU Pipeline - Realtime Rendering CH3
GPU Pipeline - Realtime Rendering CH3
 
[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for Artist[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for Artist
 
Minimizing CPU Shortage Risks in Integrated Embedded Software
Minimizing CPU Shortage Risks in Integrated Embedded SoftwareMinimizing CPU Shortage Risks in Integrated Embedded Software
Minimizing CPU Shortage Risks in Integrated Embedded Software
 
Engineering show and tell
Engineering show and tellEngineering show and tell
Engineering show and tell
 
Event Driven with LibUV and ZeroMQ
Event Driven with LibUV and ZeroMQEvent Driven with LibUV and ZeroMQ
Event Driven with LibUV and ZeroMQ
 
Memory Leak Analysis in Android Games
Memory Leak Analysis in Android GamesMemory Leak Analysis in Android Games
Memory Leak Analysis in Android Games
 
Unity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studioUnity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studio
 
Concurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System DiscussionConcurrent Programming OpenMP @ Distributed System Discussion
Concurrent Programming OpenMP @ Distributed System Discussion
 
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework  by Alex Sergeev from UberHorovod ubers distributed deep learning framework  by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
 
BruCON 2010 Lightning Talks - DIY Grid Computing
BruCON 2010 Lightning Talks - DIY Grid ComputingBruCON 2010 Lightning Talks - DIY Grid Computing
BruCON 2010 Lightning Talks - DIY Grid Computing
 
Tw2010slide2
Tw2010slide2Tw2010slide2
Tw2010slide2
 
openmp
openmpopenmp
openmp
 
TinyML as-a-Service
TinyML as-a-ServiceTinyML as-a-Service
TinyML as-a-Service
 
SpeedIT FLOW
SpeedIT FLOWSpeedIT FLOW
SpeedIT FLOW
 

Andere mochten auch

James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text
James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text
James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text Eastern European Computer Vision Conference
 
Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...Sayed Abulhasan Quadri
 
Image quality improvement of Low-resolution camera using Data fusion technique
Image quality improvement of Low-resolution camera using Data fusion techniqueImage quality improvement of Low-resolution camera using Data fusion technique
Image quality improvement of Low-resolution camera using Data fusion techniqueSayed Abulhasan Quadri
 
Real-Time Face Detection, Tracking, and Attributes Recognition
Real-Time Face Detection, Tracking, and Attributes RecognitionReal-Time Face Detection, Tracking, and Attributes Recognition
Real-Time Face Detection, Tracking, and Attributes RecognitionJia-Bin Huang
 
TargetSummit Moscow Late 2016 | Looksery, Julie Krasnienko
TargetSummit Moscow Late 2016 | Looksery, Julie KrasnienkoTargetSummit Moscow Late 2016 | Looksery, Julie Krasnienko
TargetSummit Moscow Late 2016 | Looksery, Julie KrasnienkoTargetSummit
 
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...Provectus
 
ProEvents Team presentation
ProEvents Team presentationProEvents Team presentation
ProEvents Team presentationElisabeta Ionita
 
RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...Nandana Mihindukulasooriya
 
3 arte romano
3 arte romano3 arte romano
3 arte romanogorbea
 
Vétérenaires Sans Frontieres International
Vétérenaires Sans Frontieres InternationalVétérenaires Sans Frontieres International
Vétérenaires Sans Frontieres InternationalFAO
 
Cómo adelgazar sin recuperar los kilos perdidos
Cómo adelgazar sin recuperar los kilos perdidosCómo adelgazar sin recuperar los kilos perdidos
Cómo adelgazar sin recuperar los kilos perdidoschicadieta
 
Ventas y compras internacionales
Ventas y compras internacionalesVentas y compras internacionales
Ventas y compras internacionalesRavaventas
 

Andere mochten auch (18)

Michael Norel - High Accuracy Camera Calibration
Michael Norel - High Accuracy Camera Calibration Michael Norel - High Accuracy Camera Calibration
Michael Norel - High Accuracy Camera Calibration
 
Andrii Babii - Application of fuzzy transform to image fusion
Andrii Babii - Application of fuzzy transform to image fusion Andrii Babii - Application of fuzzy transform to image fusion
Andrii Babii - Application of fuzzy transform to image fusion
 
James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text
James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text
James Pritts - Visual Recognition in the Wild: Image Retrieval, Faces, and Text
 
Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...Multi sensor data fusion system for enhanced analysis of deterioration in con...
Multi sensor data fusion system for enhanced analysis of deterioration in con...
 
Image quality improvement of Low-resolution camera using Data fusion technique
Image quality improvement of Low-resolution camera using Data fusion techniqueImage quality improvement of Low-resolution camera using Data fusion technique
Image quality improvement of Low-resolution camera using Data fusion technique
 
Real-Time Face Detection, Tracking, and Attributes Recognition
Real-Time Face Detection, Tracking, and Attributes RecognitionReal-Time Face Detection, Tracking, and Attributes Recognition
Real-Time Face Detection, Tracking, and Attributes Recognition
 
TargetSummit Moscow Late 2016 | Looksery, Julie Krasnienko
TargetSummit Moscow Late 2016 | Looksery, Julie KrasnienkoTargetSummit Moscow Late 2016 | Looksery, Julie Krasnienko
TargetSummit Moscow Late 2016 | Looksery, Julie Krasnienko
 
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
Федор Поляков (Looksery) “Face Tracking на мобильных устройствах в режиме реа...
 
Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)
Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)
Resources optimisation for OpenGL — Lesya Voronova (Looksery, Tech Stage)
 
Teruel Emprende, ¿y Tú? 2015
Teruel Emprende, ¿y Tú? 2015Teruel Emprende, ¿y Tú? 2015
Teruel Emprende, ¿y Tú? 2015
 
ProEvents Team presentation
ProEvents Team presentationProEvents Team presentation
ProEvents Team presentation
 
Retailing
RetailingRetailing
Retailing
 
RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...RDF Validation in a Linked Data World - A vision beyond structural and value ...
RDF Validation in a Linked Data World - A vision beyond structural and value ...
 
3 arte romano
3 arte romano3 arte romano
3 arte romano
 
Eerm mapping c++
Eerm mapping c++Eerm mapping c++
Eerm mapping c++
 
Vétérenaires Sans Frontieres International
Vétérenaires Sans Frontieres InternationalVétérenaires Sans Frontieres International
Vétérenaires Sans Frontieres International
 
Cómo adelgazar sin recuperar los kilos perdidos
Cómo adelgazar sin recuperar los kilos perdidosCómo adelgazar sin recuperar los kilos perdidos
Cómo adelgazar sin recuperar los kilos perdidos
 
Ventas y compras internacionales
Ventas y compras internacionalesVentas y compras internacionales
Ventas y compras internacionales
 

Ähnlich wie Optimizing computer vision problems on mobile platforms

Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMDWei-Ta Wang
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to knowRoberto Agostino Vitillo
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiTakuya ASADA
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixCodemotion Tel Aviv
 
Objects? No thanks!
Objects? No thanks!Objects? No thanks!
Objects? No thanks!corehard_by
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoValeriia Maliarenko
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Yulia Tsisyk
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net PerformanceCUSTIS
 
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate GuideДмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate GuideUA Mobile
 
Optimizing Games for Mobiles
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for MobilesSt1X
 
Practical C++ Generative Programming
Practical C++ Generative ProgrammingPractical C++ Generative Programming
Practical C++ Generative ProgrammingSchalk Cronjé
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Unity Technologies
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin敬倫 林
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source codePVS-Studio
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source codeAndrey Karpov
 

Ähnlich wie Optimizing computer vision problems on mobile platforms (20)

Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
Vectorization on x86: all you need to know
Vectorization on x86: all you need to knowVectorization on x86: all you need to know
Vectorization on x86: all you need to know
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgi
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
 
Objects? No thanks!
Objects? No thanks!Objects? No thanks!
Objects? No thanks!
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net Performance
 
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate GuideДмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
 
Optimizing Games for Mobiles
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for Mobiles
 
8871077.ppt
8871077.ppt8871077.ppt
8871077.ppt
 
Practical C++ Generative Programming
Practical C++ Generative ProgrammingPractical C++ Generative Programming
Practical C++ Generative Programming
 
Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
Vectorization in ATLAS
Vectorization in ATLASVectorization in ATLAS
Vectorization in ATLAS
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
 

Kürzlich hochgeladen

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Kürzlich hochgeladen (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Optimizing computer vision problems on mobile platforms

  • 1. Optimizing computer vision problems on mobile platforms Looksery.com
  • 2. Fedor Polyakov Software Engineer, CIO Looksery, INC fedor@looksery.com +380 97 5900009 (mobile) www.looksery.com
  • 3. Optimize algorithm first • If your algorithm is suboptimal, “technical” optimizations won’t be as effective as just algo fixes • When you optimize the algorithm, you’d probably have to change your technical optimizations too
  • 4. • Single instruction - multiple data • On NEON, 16x128-bit wide registers (up to 4 int32_t’s/floats, 2 doubles) • Uses a bit more cycles per instruction, but can operate on a lot more data • Can ideally give the performance boost of up to 4x times (typically, in my practice ~2-3x) • Can be used for many image processing algorithms • Especially useful at various linear algebra problems SIMD operations
  • 5. • The easiest way - you just use the library and it does everything for you • Eigen - great header-only library for linear algebra • Ne10 - neon-optimized library for some image processing/DSP on android • Accelerate.framework - lots of image processing/DSP on iOS • OpenCV, unfortunately, is quite weakly optimized for ARM SIMD (though, they’ve optimized ~40 low-level functions in OpenCV 3.0) • There are also some commercial libraries • + Everything is done without any your efforts • - You should still profile and analyze the ASM code to verify that everything is vectorized as you expect Using computer vision/algebra/DSP libraries
  • 6. using v4si = int __attribute__ ((vector_size (VECTOR_SIZE_IN_BYTES))); v4si x, y; • All common operations with x are now vectorized • Written once and for all architectures • Operations supported +, -, *, /, unary minus, ^, |, &, ~, %, <<, >>, comparisons • Loading from memory in a way like this x = *((v4si*)ptr); • Loading back to memory in a way like this *((v4si*)ptr) = x; • Supports subscript operator for accessing individual elements • Not all SIMD operations supported • May produce suboptimal code GCC/clang vector extensions
  • 7. • Provide a custom data types and a set of c functions to vectorize code • Example: float32x4_t vrsqrtsq_f32(float32x4_t a, float32x4_t b); • Generally, are similar to previous approach though give you a better control and full instruction set. • Cons: • Have to write separate code for each platform • In all the above approaches, compiler may inject some instructions which can be avoided in hand-crafted code • Compiler might generate code that won’t use the pipeline efficiently SIMD intrinsics
  • 8. • Gives you the most control - you know what code will be generated • So, if created carefully, can sometimes be up to 2 times faster than the code generated by compiler using previous approaches (usually 10-15% though) • You need to write separate code for each architecture :( • Need to learn • Harder to create • In order to get the maximum performance possible, some additional steps may be required Handcrafted ASM code
  • 9. • Reduce data types to as small as possible • If you can change double to int16_t, you’ll get more than 4x performance boost • Try using pld intrinsic - it “hints” CPU to load some data into caches which will be used in a near future (can be used as __builtin_prefetch) • If you use intrinsics, watch out for some extra loads/stores which you may be able to get rid of • Use loop unrolling • Interleave load/store instructions and arithmetical operations • Use proper memory alignment - can cause crashes/slow down performance Some other tricks
  • 10. • Sum of matrix rows • Matrices are 128x128, test is repeated 10^5 times Some benchmarks // Non-vectorized code for (int i = 0; i < matSize; i++) { for (int j = 0; j < matSize; j++) { rowSum[j] += testMat[i][j]; } } // Vectorized code for (int i = 0; i < matSize; i++) { for (int j = 0; j < matSize; j += vectorSize) { VectorType x = *(VectorType*)(testMat[i] + j); VectorType y = *(VectorType*)(rowSum + j); y += x; *(VectorType*)(rowSum + j) = y; } }
  • 11. Some benchmarks Tested on iPhone 5, results on other phones show pretty much the same 0 1 2 3 4 5 6 7 8 9 10 Simple Vectorized Time,s int float short Got more than 2x performance boost, mission completed?
  • 12. Some benchmarks 0 1 2 3 4 5 6 7 8 9 10 Simple Vectorized Loop unroll Time,s int float short Got another ~15% for (int i = 0; i < matSize; i++) { auto ptr = testMat[i]; for (int j = 0; j < matSize; j += 4 * xSize) { auto ptrStart = ptr + j; VT x1 = *(VT*)(ptrStart + 0 * xSize); VT y1 = *(VT*)(rowSum + j + 0 * xSize); y1 += x1; VT x2 = *(VT*)(ptrStart + 1 * xSize); VT y2 = *(VT*)(rowSum + j + 1 * xSize); y2 += x2; VT x3 = *(VT*)(ptrStart + 2 * xSize); VT y3 = *(VT*)(rowSum + j + 2 * xSize); y3 += x3; VT x4 = *(VT*)(ptrStart + 3 * xSize); VT y4 = *(VT*)(rowSum + j + 3 * xSize); y4 += x4; *(VT*)(rowSum + j + 0 * xSize) = y1; *(VT*)(rowSum + j + 1 * xSize) = y2; *(VT*)(rowSum + j + 2 * xSize) = y3; *(VT*)(rowSum + j + 3 * xSize) = y4; } }
  • 13. Some benchmarks Let’s take a look at profiler
  • 14. Some benchmarks // Non-vectorized code for (int i = 0; i < matSize; i++) { for (int j = 0; j < matSize; j++) { rowSum[i] += testMat[j][i]; } } // Vectorized, loop-unrolled code for (int i = 0; i < matSize; i+=4 * xSize) { VT y1 = *(VT*)(rowSum + i); VT y2 = *(VT*)(rowSum + i + xSize); VT y3 = *(VT*)(rowSum + i + 2*xSize); VT y4 = *(VT*)(rowSum + i + 3*xSize); for (int j = 0; j < matSize; j ++) { x1 = *(VT*)(testMat[j] + i); x2 = *(VT*)(testMat[j] + i + xSize); x3 = *(VT*)(testMat[j] + i + 2*xSize); x4 = *(VT*)(testMat[j] + i + 3*xSize); y1 += x1; y2 += x2; y3 += x3; y4 += x4; } *(VT*)(rowSum + i) = y1; *(VT*)(rowSum + i + xSize) = y2; *(VT*)(rowSum + i + 2*xSize) = y3; *(VT*)(rowSum + i + 3*xSize) = y4; }
  • 15. Some benchmarks 0 1 2 3 4 5 6 7 8 9 10 Simple Vect + Loop Time,s int float Short
  • 16. Some benchmarks 0 1 2 3 4 5 6 7 8 9 10 Simple Vectorized Vect + Loop Eigen SumOrder Asm Time,s float
  • 17. Using GPGPU • Around 1.5 orders of magnitude bigger theoretical performance • On iPhone 5, CPU has like ~800 MFlops, GPU has 28.8 GFlops • On iPhone 5S, CPU has 1.5~ GFlops, GPU has 76.4 GFlops ! • Can be very hard to utilize efficiently • CUDA, obviously, isn’t available on mobile devices • OpenCL isn’t available on iOS and is hardly available on android • On iOS, Metal is available for GPGPU but only starting with iPhone 5S • On Android, Google promotes Renderscript for GPGPU • So, the only cross-platform way is to use OpenGL ES (2.0)
  • 18. Common usage of shaders for GPGPU Shader 1 Image Data Texture containing processed data Shader 2 … Data Results Display on screen Read back to cpu
  • 19. Common problems • Textures were designed to hold RGBA8 data • On almost all phones starting 2012, half-float and float textures are supported as input • Effective bilinear filtering for float textures may be unsupported or ineffective • On many devices, writing from fragment shader to half-float (16 bit) textures is supported. • Emulating the fixed-point arithmetic is pretty straightforward • Emulating floating-point is possible, but a bit tricky and requires more operations • Change of OpenGL states may be expensive • For-loops with non-const number of iterations not supported on older devices • Reading from GPU to CPU is very expensive • There are some platform-dependent way to make it faster
  • 20. Tasks that can be solved on OpenGL ES • Image processing • Image binarization • Edge detection (Sobel, Canny) • Hough transform (though, some parts can’t be implemented on GPU) • Histogram equalization • Gaussian blur/other convolutions • Colorspace conversions • Much more examples in GPUImage library for iOS • For other tasks, it depends on many factors • We tried to implement our tracking on GPU, but didn’t get the expected performance boost