Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Balancing Power & Performance Webinar

456 Aufrufe

Veröffentlicht am

One of the biggest issues for a developer – whether they are an engineer at an OEM or working for a mobile AI application startup – is that their apps are at the mercy of pre-set power and performance settings as defined by OEMs or Silicon vendors. So how can a developer break through that barrier when it seems their hands are tied behind their backs? The Snapdragon Power Optimization SDK allows developers to control the CPU and GPU frequency much more finely from their own application logic. This provides developers with more control within the bounds of the power/thermal framework.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Balancing Power & Performance Webinar

  1. 1. Balancing Power & Performance for Mobile Applications Aravind Raghavan Staff Engineer Qualcomm Technologies, Inc.
  2. 2. 2 • APIs to manage task execution across CPU, GPU and DSP • Efficient data management between compute cores • Provide abstraction from low-level system calls and data management • Integrates with existing development environment • C++11, OpenCL, OpenGL, Qualcomm® Hexagon™ SDK (DSP) Userspace Application Heterogeneous Compute SDK Snapdragon CPU GPU DSP Patterns Affinity Tasks Buffers Qualcomm® Snapdragon™ Heterogeneous Compute SDK What is it? Qualcomm Snapdragon and Qualcomm Hexagon are products of Qualcomm Technologies, Inc. and/or its subsidiaries
  3. 3. 3 Kernel Computation to be executed on CPU/GPU/DSP Kernel Your Existing Algorithms • Actual unit of work • In Beta ◦ Poly Kernel: Write all, Run Somewhere ◦ Point Kernel: Write all, Run Everywhere OpenCL or OpenGL Kernel GPU KernelCPU Kernel DSP Kernel written using Hexagon SDK DSP Kernel C++ functors, lambda, or function pointers Attributes: Affinity, Blocking
  4. 4. 4 Kernel Code Sample Function doubles values of an input vector
  5. 5. 5 Kernel Code Sample Create a CPU kernel for vector_double
  6. 6. 6 Affinity • CPU core selection APIs • Use APIs with ◦ Standalone functions ◦ Tasks, CPU Kernel abstractions • Benefit: improve performance and save power Control placement of algorithm execution Userspace Application Heterogeneous Compute SDK Snapdragon GPU DSP CPU Patterns Affinity Tasks Buffers
  7. 7. 7 Affinity Control placement of algorithm execution Encapsulate your existing code execute(settings, fn, fn_args) Standalone functions Use with Tasks/CPU Kernel set_big() set_little() Task, Kernel • Location: Choose CPU where program construct should run • Pinning: Determines if thread can migrate freely among cores • Mode: Override or adhere to local affinity settings
  8. 8. 8 Affinity Code Sample Tell CPU Kernel to run in big Cluster
  9. 9. 9 Patterns Simplify Parallel Programming Userspace Application Heterogeneous Compute SDK Data Algorithm Snapdragon GPU DSP Data Algorithm Data Algorithm CPU • Commonly used parallel CPU programming constructs ◦ Data Parallelism ◦ Multi-branch recursion (divide & conquer) ◦ Pipeline computation • Optimize parallel execution further using Pattern Tuners • In Beta: Some patterns can execute across CPU, GPU, DSP Patterns Affinity Tasks Buffers
  10. 10. 10 Pattern Name Description hetcompute::pfor_each Processes the elements of a collection in parallel hetcompute::ptransform Performs a map operation on all elements of a collection, returns a new collection hetcompute::pscan Performs and in-place parallel prefix operation for all elements of a collection hetcompute::preduce Combines all the elements in a collection into one using an associative binary operator hetcompute::pdivide_and_conquer Divides problems into sub-problems, solves them, and merges their solutions in parallel hetcompute::pipeline A sequence of processing stages that can execute concurrently on a data stream Patterns Parallelize commonly occurring algorithmic constructs
  11. 11. 11 Pattern Tuner API Description set_chunk_size(size_t sz) Smallest granularity for load balancing. If computational kernel is small, set a large chunk size to minimize the synchronization overhead. set_max_doc(size_t doc) Max degree of concurrency, default is set to the number of available device threads set_static() Use a static chunking algorithm as the parallelization backend set_dynamic() Use a dynamic workload balancing algorithm as the parallelization backend set_shape(pattern::shape) Set shape of workload distribution across range of work-items set_cpu_load() set_gpu_load() set_dsp_load() Set fraction of workload to schedule on CPU, GPU, DSP Programmer Hints: Pattern Tuner Customize parallel algorithm execution for finer optimizations
  12. 12. 12 Patterns Code Sample Parallelize vector_double across all CPUs
  13. 13. 13 Tasks Fundamental unit of asynchrony • Independent units of work that can be executed asynchronously in CPU, GPU, DSP • Computation bound with data ◦ Control: C++ Lambda & Functions, Kernel, Patterns, … ◦ Data: Buffers, Function arguments, … • Easy task management • Groups bundle set of related tasks Userspace Application Heterogeneous Compute SDK Snapdragon CPU DSPGPU CPU Task Control Data GPU Task Control Data DSP Task Control Data Patterns Affinity Tasks Buffers
  14. 14. 14 Task APIs Description hetcompute::create_task Creates a Heterogeneous Compute task. t1->then(t2) Control Dependency from t1 to t2 t2->bind_all(t1) Data Dependency from t1 to t2 t->launch() Launches a task into Heterogeneous Compute Runtime t->wait_for() Waits for the task to complete (Blocking call) t->cancel() Cancel a launched task. Should be used with hetcompute::abort_on_cancel() to cancel running task. Tasks Fundamental unit of asynchrony
  15. 15. 15 Tasks Code Sample Create a task with CPU Kernel
  16. 16. 16 Tasks Code Sample Launch task with HetCompute Runtime
  17. 17. 17 Tasks Code Sample Wait for Task completion
  18. 18. 18 Buffers Heterogeneous Memory Management • Managed array-like data store for user- defined data-types • Abstracts specialized memory for OpenGL, OpenCL, Textures, ION • Accessible by CPU, GPU, DSP Tasks and Host application • APIs move and synchronize data across compute cores efficiently Userspace Application Heterogeneous Compute SDK Snapdragon CPU DSPGPU DSP Task Control GPU Task Control CPU Task Control Buffers Host buffer access Patterns Affinity Tasks Buffers
  19. 19. 19 Buffer APIs Description hetcompute::create_buffer<T> Creates a Heterogeneous Compute buffer. Supports different variants preallocated memory, ION/GL/CL Memory hetcompute::buffer_ptr<T> Smart pointer to managed buffer, has a std:array like interface acquire_ro() Acquire the buffer with read-only access. To be used by application host code acquire_wi() Acquire the buffer with write-invalidate access. If successful, the previous contents of the buffer are lost. To be used by application host code acquire_rw() Acquire the buffer with write access. To be used by application host code release() Releases the acquired buffer Buffers Key APIs
  20. 20. 20 Buffers Code Sample Create a buffer of int
  21. 21. 21 Buffers Code Sample Acquire buffer in application and fill data
  22. 22. 22 Buffers Code Sample Create a CPU kernel task and bind buffer data
  23. 23. 23 Buffers Code Sample Launch Task. Task always has read- write access over buffer
  24. 24. 24 Buffers Heterogeneous Memory Management ION OpenCL/GL host- accessible Host Memory big CPU LITTLE CPU GPU DSP Memory Accessibility CPU GPU DSP Host Memory Yes No No OpenCL/GL host- accessible Yes Yes No ION Yes Yes Yes Using ION Memory as backing store can improve performance (avoids copy)
  25. 25. 25 Power Optimization SDK
  26. 26. 26 Approaches to Power Management • Standard system power management • Acceptable for many use cases • Generic solution, leaving opportunity for power optimization with some algorithms Reactive vs Proactive SystemApplication Workload Reactive Model Power/Thermal Adjustment SystemApplication Workload Power/Thermal Adjustment Proactive Model Direct Recommendation • Developer-driven power management • Control power consumption during algorithm execution • Developer understanding of algorithm and system can lead to additional power optimization opportunity
  27. 27. 27 Power Optimization SDK • APIs to provide granular control of core frequencies • Developers request power control for their algorithm • Requests subject to system constraints ◦ Does not override system ◦ Interfaces with Perflock • Static and Dynamic power management APIs for CPU and GPU Run-time power and performance control for CPU and GPU Userspace Application Power Optimization SDK Snapdragon CPU GPU DSP Static Dynamic Perflock
  28. 28. 28 Power Optimization SDK • One API call to control CPU and GPU clock frequency • Choose one of 5 predefined modes • Define the duration the mode should be active • Target the device (big CPU, LITTLE CPU, GPU) Static APIs Userspace Application Power Optimization SDK Snapdragon CPU GPU DSP Static Dynamic Perflock
  29. 29. 29 Using the Power Optimization SDK Static APIs Power Mode Description Normal Default system state Efficient Close to best performance with power savings Performance Burst All cores at max frequency for short duration Saver Half of peak performance Window Set minimum and maximum frequency window
  30. 30. 30 Set big Cluster to operate between 50- 60% of max frequency index Power Optimization SDK – Static API Code Samples
  31. 31. 31 Power Optimization SDK • Self-regulates performance while trying to minimize energy consumption • Realtime Applications – Games, Streaming, Video • Currently supported only in BIG Cluster Dynamic APIs Userspace Application Power Optimization SDK Snapdragon CPU GPU DSP Static Dynamic Perflock Experimental
  32. 32. 32 Using the Power Optimization SDK Dynamic APIs Power Mode Description set_goal() Start the automatic performance/power regulation mode. regulate() Application feedback to the SDK, this is used by the SDK to self-regulate the system to achieve the performance and save power clear_goal() Terminate the regulation process Experimental
  33. 33. 33 Goal: # of elements application wants to process in a millisecond Power Optimization SDK – Dynamic API Code Samples Experimental
  34. 34. 34 Application processing Track # of elements processed per millisecond Power Optimization SDK – Dynamic API Code Samples Experimental
  35. 35. 35 Power Optimization SDK – Dynamic API Code Samples Experimental Allow API to make adjustments based on number of elements being processed per millisecond
  36. 36. 36 Power Optimization SDK – Dynamic API Code Samples Experimental Put the system back to Normal state
  37. 37. 37 Power Improvement Case Study Using Heterogeneous Compute SDK and Power Optimization SDK to Improve Power
  38. 38. 38 Lowering Power consumption Heterogeneous execution is key for managing power/thermal Using more cores and lowering their frequency can get the same performance and consume less power
  39. 39. 39 Case Study: Find all primes under 10 million Sequential variant
  40. 40. 40 Case Study: Find all primes under 10 million Sequential variant is_prime has some optimizations already like skipping even numbers, Run only through sqrt(n)
  41. 41. 41 Using Profiler Case Study: Find all primes under 10 million # of Cores CPU Utilization CPU Frequency Processing Time CPU Power
  42. 42. 42 Sequential variant Case Study: Find all primes under 10 million # of Cores 1 CPU Utilization 100% CPU Frequency Max(1.9GHz) Processing Time 34 seconds CPU Power 125mW
  43. 43. 43 Sequential variant Case Study: Find all primes under 10 million # of Cores 1 CPU Utilization 100% CPU Frequency Max(1.9GHz) Processing Time 34 seconds CPU Power 125mW
  44. 44. 44 Sequential variant Case Study: Find all primes under 10 million # of Cores 1 CPU Utilization 100% CPU Frequency Max(1.9GHz) Processing Time 34 seconds CPU Power 125mW
  45. 45. 45 Sequential variant Case Study: Find all primes under 10 million # of Cores 1 CPU Utilization 100% CPU Frequency Max(1.9GHz) Processing Time 34 seconds CPU Power 125mW 34 sec
  46. 46. 46 Sequential variant Case Study: Find all primes under 10 million # of Cores 1 CPU Utilization 100% CPU Frequency Max(1.9GHz) Processing Time 34 seconds CPU Power 125mW
  47. 47. 47 Case Study: Find all primes under 10 million What can we parallelize?
  48. 48. 48 Case Study: Find all primes under 10 million What can we parallelize? Iterative loop - pfor_each can be used to parallelize
  49. 49. 49 Case Study: Find all primes under 10 million Parallel variant Simple parallel version – could improve work distribution between big/LITTLE and chunk size
  50. 50. 50 Parallel variant Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Max(1.90/2.36 GHz) Processing Time 6.2 seconds CPU Power 281 mW
  51. 51. 51 Parallel variant Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Max(1.90/2.36 GHz) Processing Time 6.2 seconds CPU Power 281 mW
  52. 52. 52 Parallel variant Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Max(1.90/2.36 GHz) Processing Time 6.2 seconds CPU Power 281 mW
  53. 53. 53 Parallel variant Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Max(1.90/2.36 GHz) Processing Time 6.2 seconds CPU Power 281 mW ~6sec
  54. 54. 54 Parallel variant Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Max(1.90/2.36 GHz) Processing Time 6.2 seconds CPU Power 281 mW
  55. 55. 55 Compare Sequential and Parallel variant Case Study: Find all primes under 10 million Sequential Parallel # of Cores 1 8 CPU Utilization 100% 100% CPU Frequency Max (1.90 GHz) Max (1.90/2.36 GHz) Processing Time 34 sec 6.2 sec (82%) CPU Power 125 mW 281 mW (55%)
  56. 56. 56 Case Study: Find all primes under 10 million Parallel variant Can we use Power SDK to fine- tune Power Consumption?
  57. 57. 57 Case Study: Find all primes under 10 million Parallel variant with Power Tuning Goal: Max Power Savings Request big and LITTLE cluster run at 0-15% of max frequency
  58. 58. 58 Parallel variant with Power Tunings Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Min(512/652 MHz) Processing Time 26 seconds CPU Power 82 mW
  59. 59. 59 Parallel variant with Power Tunings Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Min(512/652 MHz) Processing Time 26 seconds CPU Power 82 mW
  60. 60. 60 Parallel variant with Power Tunings Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Min(512/652 MHz) Processing Time 26 seconds CPU Power 82 mW
  61. 61. 61 Parallel variant with Power Tunings Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Min(512/652 MHz) Processing Time 26 seconds CPU Power 82 mW 26sec
  62. 62. 62 Parallel variant with Power Tunings Case Study: Find all primes under 10 million # of Cores 8 CPU Utilization 100% CPU Frequency Min(512/652 MHz) Processing Time 26 seconds CPU Power 82 mW
  63. 63. 63 Comparison chart - Recap Case Study: Find all primes under 10 million Sequential Parallel Parallel with Power SDK (min freq) # of Cores 1 8 8 CPU Utilization 100% 100% 100% CPU Frequency Max(1.90 GHz) Max(1.90/2.36 GHz) Min(512/652 MHz) Processing Time 34 sec 6.2 sec (82%) 26 sec CPU Power 125 mW 281 mW (55%) 82 mW
  64. 64. 64 Comparison chart Case Study: Find all primes under 10 million Sequential Parallel Parallel with Power SDK (min freq) # of Cores 1 8 8 CPU Utilization 100% 100% 100% CPU Frequency Max(1.90 GHz) Max(1.90/2.36 GHz) Min(512/652 MHz) Processing Time 34 sec 6.2 sec (82%) 26 sec (23%) CPU Power 125 mW 281 mW (55%) 82 mW (34%)
  65. 65. 65 Choose the optimal power-performance Case Study: Find all primes under 10 million Sequential Parallel Parallel with Power SDK (min freq) # of Cores 1 8 8 CPU Utilization 100% 100% 100% CPU Frequency Max(1.90 GHz) Max(1.90/2.36 GHz) Min(512/652 MHz) Processing Time 34 sec 6.2 sec (82%) 26 sec (23%) CPU Power 125 mW 281 mW (55%) 82 mW (34%)
  66. 66. 66 Lowering Power consumption Strategy for power savings • Using more cores and lowering their frequency allows us to get the same performance with lower energy • Choosing right compute device is the key to lowering power ◦ Big/LITTLE/GPU/DSP • Strategy to reduce power maintaining performance Extract Parallelism Control placement of algorithm execution onto right device Power Tuning using Power SDK
  67. 67. Follow us on: For more information, visit us at: www.qualcomm.com & www.qualcomm.com/blog Thank you! Nothing in these materials is an offer to sell any of the components or devices referenced herein. ©2018 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved. Qualcomm, Snapdragon and Hexagon are trademarks of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners. References in this presentation to “Qualcomm” may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and development functions, and substantially all of its product and services businesses, including its semiconductor business, QCT.

×