20131121

•Download as PPTX, PDF•

0 likes•237 views

Jocelyn

Delta R621 15:10~15:58

Technology

WEEKLY REPORT
Thur., Nov 21, 2013
Pin Yi Tsai

OUTLINE
• Current Work
• Compute Integral Image – computeByRow
 Using shared memory
 Measure time in CUDA

 Result
 Conclusion

USING SHARED MEMORY

• Scope: block
• Each thread deal with one row, in every iteration:
 Write to shared memory first

 Read the previous result from shared memory

USING SHARED MEMORY (CONT.)

• Scope: block
• Each thread deal with one row
 Store the result to shared memory

 Write back to the global memory in the end

USING SHARED MEMORY (CONT.)
• Limitation: 49152 KB per block

 Float: 4 bytes
 12288 units / width => X rows per block

• Segment the large image to several parts
 Avoid the size exceeding the limitation

USING SHARED MEMORY (CONT.)
• 49152 KB per block

 Float: 4 bytes
 12288 units / 641 => 19 rows per block
 19 rows per block, 26 segments (height: 481)

MEASURE TIME IN CUDA
• cudaThreadSynchronize()

 similar to the non-deprecated function cudaDeviceSynchronize()
 returns an error if one of the preceding tasks has failed
• cudaDeviceSynchronize()

 blocks until the device has completed all preceding requested
tasks
• The first one is deprecated because its name does not reflect its
behavior

RESULT
• 16x16 (using shared memory with size of full image)

• Serial version: 0.00656 ms
• Parallel version: 0.197344 ms
======== Profiling result:
Time(%)

Time Calls

Avg

Min

Max Name

56.85 19.73us

1 19.73us 19.73us 19.73us computeByRow(float*, int, int)

25.17

8.73us

1

8.73us

8.73us

8.73us computeByColumn(float*, int, int)

12.54

4.35us

2

2.18us

2.18us

2.18us [CUDA memcpy DtoH]

5.44

1.89us

2

944ns

928ns

960ns [CUDA memcpy HtoD]

RESULT (CONT.)
• 640*480
• Using shared memory per line
• Serial version: 5.11238 ms
• Parallel version: 4.361386 ms
======== Profiling result:
Time(%)

Time Calls

Avg

Min

Max Name

66.36 2.18ms

1 2.18ms 2.18ms 2.18ms computeByRow(float*, int, int)

12.72 418.14us

2 209.07us 208.45us 209.70us [CUDA memcpy HtoD]

11.75 386.21us

2 193.10us 191.04us 195.17us [CUDA memcpy DtoH]

9.16 301.24us

1 301.24us 301.24us 301.24us computeByColumn(float*, int, int)

RESULT (CONT.)
• 640*480
• Using segment image and shared memory
• Serial version: 5.11238 ms
• Parallel version: 70.0833 ms
======== Profiling result:

Time(%)

Time Calls

Avg

Min

Max Name

98.22 66.23ms

26 2.55ms 2.55ms 2.55ms computeByRow(float*, int, int)

0.69 467.46us

27 17.31us 9.79us 209.76us [CUDA memcpy HtoD]

0.64 429.60us

27 15.91us 8.93us 195.58us [CUDA memcpy DtoH]

0.45 301.18us

1 301.18us 301.18us 301.18us computeByColumn(float*, int, int)

CONCLUSION
• The method doesn’t improve the performance

• Find the new method to write the massive data from shared memory to
the global memory

What's hot

Porting and optimizing UniFrac for GPUsIgor Sfiligoi

Thesis Final PresentationMd. Kamal Hossain

FlameWorks GTC 2014Simon Green

Parallel implementation of geodesic distance transform with application in su...Tuan Q. Pham

CUDA and Caffe for deep learningAmgad Muhammad

OpenGL 4.4 - Scene Rendering TechniquesNarann29

有點硬又不會太硬的DNN加速器Rouyun Pan

Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Gurbinder Gill

Squeeeze modelsDong Heon Cho

Cassandra at talkbitsMax Alexejev

GLSLSyed Zaid Irshad

ES_SAA_OG_PF_ECCTD_PosSyed Asad Alam

Post renderingAkilarLiao

Beyond portingCass Everitt

Network Analysis with networkX : Real-World Example-2Kyunghoon Kim

Exploring Gpgpu WorkloadsUnai Lopez-Novoa

Parallel Implementation of K Means Clustering on CUDAprithan

Exploring Parallel Merging In GPU Based Systems Using CUDA C.Rakib Hossain

Lab: Foundation of Concurrent and Distributed SystemsRuochun Tzeng

Advanced Scenegraph Rendering PipelineNarann29

What's hot (20)

Porting and optimizing UniFrac for GPUs

Thesis Final Presentation

FlameWorks GTC 2014

Parallel implementation of geodesic distance transform with application in su...

CUDA and Caffe for deep learning

OpenGL 4.4 - Scene Rendering Techniques

有點硬又不會太硬的DNN加速器

Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...

Squeeeze models

Cassandra at talkbits

GLSL

ES_SAA_OG_PF_ECCTD_Pos

Post rendering

Beyond porting

Network Analysis with networkX : Real-World Example-2

Exploring Gpgpu Workloads

Parallel Implementation of K Means Clustering on CUDA

Exploring Parallel Merging In GPU Based Systems Using CUDA C.

Lab: Foundation of Concurrent and Distributed Systems

Advanced Scenegraph Rendering Pipeline

Similar to 20131121

20131114Jocelyn

Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...Daosheng Mu

20131024Jocelyn

Optimizing Parallel Reduction in CUDA : NOTESSubhajit Sahu

1083 wangAndre Bueno

lecture11_GPUArchCUDA01.pptxssuser413a98

Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Tokyo Institute of Technology

Linux kernel memory allocatorsHao-Ran Liu

Adaptive Linear Solvers and Eigensolversinside-BigData.com

Designing and coding Series 40 Java apps for high performanceMicrosoft Mobile Developer

002 - Introduction to CUDA Programming_1.pptceyifo9332

Deep Learning for Computer Vision: Memory usage and computational considerati...Universitat Politècnica de Catalunya

GPU Introduction.pptxSherazMunawar5

Speedrunning the Open Street Map osm2pgsql LoaderGregSmith458515

Unity - Internals: memory and performanceCodemotion

7_mem_cache.pptRohitPaul71

Tez Shuffle Handler: Shuffling at Scale with Apache HadoopDataWorks Summit

Everything I Ever Learned About JVM Performance Tuning @TwitterAttila Szegedi

Theta and the Future of Accelerator Programminginside-BigData.com

“Show Me the Garbage!”, Garbage Collection a Friend or a FoeHaim Yadid

Similar to 20131121 (20)

20131114

Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...

20131024

Optimizing Parallel Reduction in CUDA : NOTES

1083 wang

lecture11_GPUArchCUDA01.pptx

Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...

Linux kernel memory allocators

Adaptive Linear Solvers and Eigensolvers

Designing and coding Series 40 Java apps for high performance

002 - Introduction to CUDA Programming_1.ppt

Deep Learning for Computer Vision: Memory usage and computational considerati...

GPU Introduction.pptx

Speedrunning the Open Street Map osm2pgsql Loader

Unity - Internals: memory and performance

7_mem_cache.ppt

Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop

Everything I Ever Learned About JVM Performance Tuning @Twitter

Theta and the Future of Accelerator Programming

“Show Me the Garbage!”, Garbage Collection a Friend or a Foe

Recently uploaded

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Training state-of-the-art general text embeddingZilliz

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Recently uploaded (20)

Powerpoint exploring the locations used in television show Time Clash

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

What's New in Teams Calling, Meetings and Devices March 2024

Training state-of-the-art general text embedding

DevEX - reference for building teams, processes, and platforms

Are Multi-Cloud and Serverless Good or Bad?

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Scanning the Internet for External Cloud Exposures via SSL Certs

Developer Data Modeling Mistakes: From Postgres to NoSQL

The Future of Software Development - Devin AI Innovative Approach.pdf

Designing IA for AI - Information Architecture Conference 2024

Human Factors of XR: Using Human Factors to Design XR Systems

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Dev Dives: Streamline document processing with UiPath Studio Web

My INSURER PTE LTD - Insurtech Innovation Award 2024

Search Engine Optimization SEO PDF for 2024.pdf

Vector Databases 101 - An introduction to the world of Vector Databases

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

20131121

1. WEEKLY REPORT Thur., Nov 21, 2013 Pin Yi Tsai

2. OUTLINE • Current Work • Compute Integral Image – computeByRow  Using shared memory  Measure time in CUDA  Result  Conclusion

3. USING SHARED MEMORY • Scope: block • Each thread deal with one row, in every iteration:  Write to shared memory first  Read the previous result from shared memory

4. USING SHARED MEMORY (CONT.) • Scope: block • Each thread deal with one row  Store the result to shared memory  Write back to the global memory in the end

5. USING SHARED MEMORY (CONT.) • Limitation: 49152 KB per block  Float: 4 bytes  12288 units / width => X rows per block • Segment the large image to several parts  Avoid the size exceeding the limitation

6. USING SHARED MEMORY (CONT.) • 49152 KB per block  Float: 4 bytes  12288 units / 641 => 19 rows per block  19 rows per block, 26 segments (height: 481)

7. TESLA M2050

8. MEASURE TIME IN CUDA • cudaThreadSynchronize()  similar to the non-deprecated function cudaDeviceSynchronize()  returns an error if one of the preceding tasks has failed • cudaDeviceSynchronize()  blocks until the device has completed all preceding requested tasks • The first one is deprecated because its name does not reflect its behavior

9. RESULT • 16x16 (using shared memory with size of full image) • Serial version: 0.00656 ms • Parallel version: 0.197344 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 56.85 19.73us 1 19.73us 19.73us 19.73us computeByRow(float*, int, int) 25.17 8.73us 1 8.73us 8.73us 8.73us computeByColumn(float*, int, int) 12.54 4.35us 2 2.18us 2.18us 2.18us [CUDA memcpy DtoH] 5.44 1.89us 2 944ns 928ns 960ns [CUDA memcpy HtoD]

10. RESULT (CONT.) • 640*480 • Using shared memory per line • Serial version: 5.11238 ms • Parallel version: 4.361386 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 66.36 2.18ms 1 2.18ms 2.18ms 2.18ms computeByRow(float*, int, int) 12.72 418.14us 2 209.07us 208.45us 209.70us [CUDA memcpy HtoD] 11.75 386.21us 2 193.10us 191.04us 195.17us [CUDA memcpy DtoH] 9.16 301.24us 1 301.24us 301.24us 301.24us computeByColumn(float*, int, int)

11. RESULT (CONT.) • 640*480 • Using segment image and shared memory • Serial version: 5.11238 ms • Parallel version: 70.0833 ms ======== Profiling result: Time(%) Time Calls Avg Min Max Name 98.22 66.23ms 26 2.55ms 2.55ms 2.55ms computeByRow(float*, int, int) 0.69 467.46us 27 17.31us 9.79us 209.76us [CUDA memcpy HtoD] 0.64 429.60us 27 15.91us 8.93us 195.58us [CUDA memcpy DtoH] 0.45 301.18us 1 301.18us 301.18us 301.18us computeByColumn(float*, int, int)

12. CONCLUSION • The method doesn’t improve the performance • Find the new method to write the massive data from shared memory to the global memory

13. The End

20131121

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20131121

Similar to 20131121 (20)

Recently uploaded

Recently uploaded (20)

20131121