2. OUTLINE
• Current Work
• Compute Integral Image – parallel version
• Why the difference is so implicit?
• An accidental Error
• In Process
• Compute 11 types of Features
3. COMPUTE INTEGRAL IMAGE – PARALLEL VERSION
• Computation and communication time
input 16x16:
serial version: 0.006336 ms
for loop outside of kernel function:
parallel version: 6.80778 ms
for loop inside of kernel function:
parallel version: 5.88559e-39 ms
4. COMPUTE INTEGRAL IMAGE (CONT.)
input 640x480:
serial version: 5.1607 ms
parallel version: 4.94058 ms
5. WHY THE DIFFERENCE IS SO IMPLICIT?
• Profile:
Time : 4.91024 ms
======== Profiling result:
Time(%)
71.71
Time Calls
2.75ms
1
Avg
2.75ms
Min
2.75ms
Max Name
2.75ms computeByColumn(float*, int)
10.91 418.56us
2 209.28us 209.06us 209.50us [CUDA memcpy HtoD]
10.08 386.46us
2 193.23us 191.10us 195.36us [CUDA memcpy DtoH]
7.31 280.22us
int)
1 280.22us 280.22us 280.22us computeByRow(float*, int,
Access the inconsistent memory
Memory Access is too time-consuming