2. OUTLINE
• Current Work
• Compute Integral Image – computeByRow
Using shared memory
Measure time in CUDA
Result
Conclusion
3. USING SHARED MEMORY
• Scope: block
• Each thread deal with one row, in every iteration:
Write to shared memory first
Read the previous result from shared memory
4. USING SHARED MEMORY (CONT.)
• Scope: block
• Each thread deal with one row
Store the result to shared memory
Write back to the global memory in the end
5. USING SHARED MEMORY (CONT.)
• Limitation: 49152 KB per block
Float: 4 bytes
12288 units / width => X rows per block
• Segment the large image to several parts
Avoid the size exceeding the limitation
6. USING SHARED MEMORY (CONT.)
• 49152 KB per block
Float: 4 bytes
12288 units / 641 => 19 rows per block
19 rows per block, 26 segments (height: 481)
8. MEASURE TIME IN CUDA
• cudaThreadSynchronize()
similar to the non-deprecated function cudaDeviceSynchronize()
returns an error if one of the preceding tasks has failed
• cudaDeviceSynchronize()
blocks until the device has completed all preceding requested
tasks
• The first one is deprecated because its name does not reflect its
behavior
9. RESULT
• 16x16 (using shared memory with size of full image)
• Serial version: 0.00656 ms
• Parallel version: 0.197344 ms
======== Profiling result:
Time(%)
Time Calls
Avg
Min
Max Name
56.85 19.73us
1 19.73us 19.73us 19.73us computeByRow(float*, int, int)
25.17
8.73us
1
8.73us
8.73us
8.73us computeByColumn(float*, int, int)
12.54
4.35us
2
2.18us
2.18us
2.18us [CUDA memcpy DtoH]
5.44
1.89us
2
944ns
928ns
960ns [CUDA memcpy HtoD]
10. RESULT (CONT.)
• 640*480
• Using shared memory per line
• Serial version: 5.11238 ms
• Parallel version: 4.361386 ms
======== Profiling result:
Time(%)
Time Calls
Avg
Min
Max Name
66.36 2.18ms
1 2.18ms 2.18ms 2.18ms computeByRow(float*, int, int)
12.72 418.14us
2 209.07us 208.45us 209.70us [CUDA memcpy HtoD]
11.75 386.21us
2 193.10us 191.04us 195.17us [CUDA memcpy DtoH]
9.16 301.24us
1 301.24us 301.24us 301.24us computeByColumn(float*, int, int)
11. RESULT (CONT.)
• 640*480
• Using segment image and shared memory
• Serial version: 5.11238 ms
• Parallel version: 70.0833 ms
======== Profiling result:
Time(%)
Time Calls
Avg
Min
Max Name
98.22 66.23ms
26 2.55ms 2.55ms 2.55ms computeByRow(float*, int, int)
0.69 467.46us
27 17.31us 9.79us 209.76us [CUDA memcpy HtoD]
0.64 429.60us
27 15.91us 8.93us 195.58us [CUDA memcpy DtoH]
0.45 301.18us
1 301.18us 301.18us 301.18us computeByColumn(float*, int, int)
12. CONCLUSION
• The method doesn’t improve the performance
• Find the new method to write the massive data from shared memory to
the global memory