4. - Algorithm based on Active Appearance Model.
- Algorithm complexity is independent from image size.
- You can control balance between tracking quality and tracking speed
using only two constants.
- Algorithm is iterative. Solve Least-Square problem at each iteration.
- Average 5 iterations per frame. Maximum 10, minimum 1.
- If you want run on 30 fps you have to perform about 150 iterations per second.
4
TRACKING ALGORITHM
5. Optimisation flow
—— : Algorithm asymptotic optimisation
3 FPS: First implementation
8 FPS: Memory preallocation
10 FPS: Algorithm parameters optimisation
13 FPS: Matrix storage optimisation and removing OOP code
18 FPS: Rewrite bottleneck code at assembler
24 FPS: Asymptotic optimisation of matrices multiplication
27 FPS: Replacing operations with float to operations with int
30 FPS: Multithreading
5
6. From float to int
6
G[i][j] = (X[i][j] - Y[i][j]) / d[j];
We had to build so-called pseudo-inverse, that is
So we have to perform many multiplication operations. Multiplication of two int
is much faster then multiplication of two float. Lets create int matrix V:
V[i][j] = X[i][j] - Y[i][j];
And float matrix D:
D[i][j] = ( i== j ? d[i] : 0); // diagonal matrix
Then G = V * D. From linear algebra:
7. 7
CODE TIME
const int ITERATIONS = 2000000000;
long long sum = 0;
for (int i = 0; i < ITERATIONS; i++)
sum += i * (long long)i;
cout<<sum<<endl;
0.00 sec
const int ITERATIONS = 2000000000;
long long sum = 0;
for (int i = 0; i < ITERATIONS; i++)
sum += i * (long long)i / 3;
cout<<sum<<endl;
2.10 sec
const int ITERATIONS = 2000000000;
float sum = 0;
for (int i = 0; i < ITERATIONS; i++)
sum += i * (float)i / 3;
cout<<sum<<endl;
4.29 sec
Demo benchmarks
8. Matrices multiplication optimisations
1) Don’t create a matrix with power of two size. Cache uses simple hash function to
select a cash line in which the memory will be cached. This hash is just
a some low (i.e. 16) bits of the memory address.
When you use the matrix with the size power of two, each of the row has the same
lowest bits, so you contain only one row in your cache instead of nearly a whole
matrix.
2) Change the order of matrices multiplication: to multiply two matrix n x m and m x s
you have to perform n * m * s operations. If you want to multiply the matrices
A(n x m) * B(m x s) * C(s x k), you can do it in two ways with the same result:
(A * B) * C with n*m*s + n*s*k operations.
or
A * (B * C) with m*s*k + n*m*k operations.
n*m*s + n*s*k != m*s*k + n*m*k in general case, choose the smallest one.
8
9. Hello assembler
9
int *row = GT[i];
for (int j = i, pos = (int)(i * GT.columnCount()); j < GT.rowCount(); j++)
{
int curr = 0;
for (int k = 0; k < GT.columnCount(); k++, pos++)
curr += row[k] * GT.val[pos];
GTG[i][j] = GTG[j][i] = curr;
}
It looks optimised enough. Is there anything we can improve?
Well, let’s have a look at ASM code..
0x149ac2: ldr.w lr, [r5, r9, lsl #2]
0x149ac6: add.w r9, r9, #0x1
0x149aca: cmp r9, r2
0x149acc: ldr r8, [r12], #4
0x149ad0: mla r11, lr, r8, r11
0x149ad4: blo 0x149ac2 ;at AppearanceTracker.cpp:555
No SIMD instructions there :(
10. Let’s add some SIMD
10
int *row = GT[i];
int *rowInit = row;
int *rowPos = GT.val + i * GT.columnCount();
int *rowEnd = row + processedCnt;
for (int j = i; j < GT.rowCount(); j++)
{
row = rowInit;
int accum[8] = {0};
__asm__ volatile
(
"vld1.32 {d8-d11}, [%[accum]] nt"
"L_mulStart%=:nt"
"vld1.32 {d0-d3}, [%[row]]! nt"
"vld1.32 {d4-d7}, [%[val]]! nt"
"vmla.i32 q4, q2, q0 nt"
"vmla.i32 q5, q3, q1 nt"
"cmp %[row], %[rowEnd]nt"
"blo L_mulStart%=nt"
"vst1.32 {d8-d11}, [%[accum]]nt"
: [row] "+r" (row), [val] "+r" (rowPos)
: [rowEnd] "r" (rowEnd), [accum] "r" (accum)
);
//собирание 8 значений из accum
//допроцесс остатка mod 8
}
int *row = GT[i];
for (int j = i, pos = (int)(i * GT.columnCount());
j < GT.rowCount(); j++)
{
int curr = 0;
for (int k = 0; k < GT.columnCount();
k++, pos++)
curr += row[k] * GT.val[pos];
GTG[i][j] = GTG[j][i] = curr;
}
12. 12
Some issue about hardware
Task: Crop a square from CMSampleBuffer(that contains CVImageBufferRef)
and write it using AVAssetWriterInputPixelBufferAdaptor
Input buffer address
Target image address
Create CMSampleBuffer by
just moving base address and new
setting height.
O(1) operation.
BAD
Create CMSampleBuffer by
creating new CVPixelBufferRef
from CVTextureCache and copy
image.
O(Height*Width) operation
GOOD
13. 13
iOS 8 strikes back
iPhone 5S iOS 7.1 - 30 FPS
iPhone 5S iOS 8.0 - 15 FPS O_o
Possible reasons:
1) Memory corruption at C++ core code
2) iOS 8 QOS:
Wrong queue priority: QOS_CLASS_BACKGROUND instead of QOS_CLASS_USER_INITIATED
3) Blinking of this guy