Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

Nächste SlideShare
Intel the-latest-on-ofi
Intel the-latest-on-ofi
Wird geladen in …3

Hier ansehen

1 von 44 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)


Ähnlich wie Petapath HP Cast 12 - Programming for High Performance Accelerated Systems (20)

Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

  1. 1. Petapath Dairsie Latimer and Michal Harasimiuk Programming for High Performance Accelerated Systems
  2. 2. Petapath
  3. 3. Petapath
  4. 4. Petapath Joint Petapath/HP PRACE WP8 Prototype system at SARA/NCF
  5. 5. Petapath Joint Petapath/HP PRACE WP8 Prototype system at SARA/NCF 6U 10 TFLOPS 7 kW
  6. 6. Petapath <ul><ul><li>20 racks, 1.125 PFLOPS, end of 2009 </li></ul></ul><ul><ul><li>500KW </li></ul></ul><ul><ul><li>Alternative systems – 15x the size, 8x the power </li></ul></ul>
  7. 7. Programming for High Performance Accelerated Systems <ul><ul><li>Overview of the development environment at SARA/NCF </li></ul></ul><ul><ul><li>Options for programming heterogeneous systems </li></ul></ul><ul><ul><li>Moving software development flows from multi-core to heterogeneous systems </li></ul></ul><ul><ul><li>Developing with OpenCL going forward </li></ul></ul>Petapath
  8. 8. Petapath Petapath/HP PRACE Prototype system at SARA/NCF
  9. 9. ClearSpeed Software Development environment at SARA/NCF <ul><ul><li>ClearSpeed SDK Version 3.1 </li></ul></ul><ul><ul><ul><li>Binary compatible across all ClearSpeed based products </li></ul></ul></ul><ul><ul><li>C n Optimising Compiler </li></ul></ul><ul><ul><ul><li>C with poly extensions for SIMD data types </li></ul></ul></ul><ul><ul><li>Debugger – a port of gdb </li></ul></ul><ul><ul><ul><li>Runs on hardware </li></ul></ul></ul><ul><ul><li>Profiler – csprof </li></ul></ul><ul><ul><ul><li>Allows system-wide visualization of an accelerated application’s performance while running on both a multi-core host and ClearSpeed accelerators </li></ul></ul></ul><ul><ul><li>Libraries (BLAS, RNG and FFT) & High level APIs (CSPX) </li></ul></ul>
  10. 10. <ul><ul><li>Standard Eclipse graphical debug interface for CSX processors </li></ul></ul><ul><ul><li>CSX processors provide full hardware debugging of running application code </li></ul></ul><ul><ul><li>Provides seamless view of many processor cores in parallel with their associated state </li></ul></ul><ul><ul><li>Allows full symbolic debug of the C n language </li></ul></ul><ul><ul><li>Enhanced views for CSX specific information </li></ul></ul>ClearSpeed graphical debug interface for the heterogeneous systems Images used with permission of ClearSpeed Technology Plc
  11. 11. ClearSpeed profiler for heterogeneous and multi-processor systems Advance™ Accelerator Board CSX 600 Pipeline CSX 600 Pipeline Host CPU(s) Host CPU(s) Host CPU(s) Advance™ Accelerator Board Host Cores(s) CSX Pipeline HOST/BOARD INTERACTION View host/board interactions. Provides performance information for data transfer operations. Trace cluster node/board interaction. See overlap of host compute and board compute. CSX PIPELINE View detailed instruction issue information. Visualize overlap of executing instructions. Optimize code at the instruction level. View instruction level performance bottlenecks. Get accurate instruction timing. CSX SYSTEM View system level trace. Visually inspect the overlap of compute and I/O. Visualize cache utilization. View branch trace of code executing. Find and analyse performance bottlenecks. Get accurate event timing ClearSpeed Accelerated System CSX Pipeline HOST CODE PROFILING Visually inspect host code executing. Supports multiple threads and processes. Time specific code sections. See overlap of host threads executing. Platform and processor agnostic trace collection. PCIe
  12. 12. Petapath Programming for High Performance Accelerated Systems
  13. 13. Programming for High Performance Accelerated Systems Introduction <ul><ul><li>Heterogeneous systems are now increasingly common </li></ul></ul><ul><ul><li>They are being adopted at the top (Top500) and the bottom (technical workstation) of the HPC market </li></ul></ul><ul><ul><li>Acceleration can deliver significant performance and cost savings over traditional COTS HPC systems </li></ul></ul><ul><ul><li>However, there are real barriers to adoption: </li></ul></ul><ul><ul><ul><li>Software support and programming models </li></ul></ul></ul><ul><ul><ul><li>Host system requirements </li></ul></ul></ul>
  14. 14. <ul><ul><li>In order to take advantage of this new technology trend, what are the realistic options? </li></ul></ul><ul><ul><li>Some important things to consider: </li></ul></ul><ul><ul><ul><li>Single or multi-use system? </li></ul></ul></ul><ul><ul><ul><li>Where do the majority of the cycles go? </li></ul></ul></ul><ul><ul><ul><li>ISV codes or Open Source/Custom Codes? </li></ul></ul></ul><ul><ul><ul><ul><li>Sufficient development resources? </li></ul></ul></ul></ul>
  15. 15. <ul><ul><li>Starting with application source, what is the best way to target heterogeneous computing today? </li></ul></ul><ul><ul><li>Proprietary development environments and hardware: </li></ul></ul><ul><ul><ul><li>Advance TM /C n (ClearSpeed) </li></ul></ul></ul><ul><ul><ul><li>Tesla TM /CUDA (NVIDIA) </li></ul></ul></ul><ul><ul><ul><li>Stream TM /Brook+ (AMD) </li></ul></ul></ul><ul><ul><ul><li>FPGA based solutions </li></ul></ul></ul><ul><ul><li>Or via Third Party/Middleware: </li></ul></ul><ul><ul><ul><li>RapidMind TM Platform </li></ul></ul></ul><ul><ul><ul><li>CAPS HMPP TM </li></ul></ul></ul><ul><ul><ul><li>PGI’s x64+GPU Accelerate Model </li></ul></ul></ul><ul><ul><ul><li>e.g. Mitrion Development Platform for FPGAs </li></ul></ul></ul>
  16. 16. <ul><ul><li>These options can loosely be categorised into </li></ul></ul><ul><ul><li>Language </li></ul></ul><ul><ul><ul><li>C n , CUDA, Brook+, Mitrion C, OpenCL </li></ul></ul></ul><ul><ul><li>Directive based or hybrid approaches </li></ul></ul><ul><ul><ul><li>PGI x86+GPU, CAPS HMPP, RapidMind </li></ul></ul></ul><ul><ul><ul><ul><li>Allow re-targetable support </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Can potentially support multiple vendor development environments </li></ul></ul></ul></ul>
  17. 17. <ul><ul><li>Library </li></ul></ul><ul><ul><ul><li>All the languages have a library component </li></ul></ul></ul><ul><ul><ul><ul><li>Manages hardware resources and runtime interaction </li></ul></ul></ul></ul><ul><ul><ul><li>Can also provide higher level abstractions such as standard library support, e.g. BLAS or LAPACK </li></ul></ul></ul><ul><ul><ul><li>Some libraries are available from third parties that are designed to transparently interface ISV applications to accelerator hardware </li></ul></ul></ul><ul><ul><ul><li>Often the best implementations are available from the vendors themselves </li></ul></ul></ul>
  18. 18. <ul><ul><li>Industry will inevitably move towards available open standards </li></ul></ul><ul><ul><li>We believe that the Khronos Group’s OpenCL TM (Open Computing Language) will be a key enabler in the wider adoption of heterogeneous systems </li></ul></ul><ul><ul><li>Petapath are members of the Khronos Group and participants on the OpenCL working group </li></ul></ul>What comes next?
  19. 19. <ul><ul><li>OpenCL is the first open, royalty-free standard for general-purpose parallel programming of heterogeneous systems </li></ul></ul><ul><ul><li>OpenCL provides uniform programming environment for software developers </li></ul></ul><ul><ul><ul><li>Can write efficient, portable code for a range of high-performance systems and a diverse mix of multi-core and parallel processors </li></ul></ul></ul>OpenCL
  20. 20. <ul><ul><li>OpenCL consists of: </li></ul></ul><ul><ul><ul><li>An API for coordinating parallel computation </li></ul></ul></ul><ul><ul><ul><li>A programming language for describing those computations. </li></ul></ul></ul><ul><ul><li>Specifically, the OpenCL standard defines: </li></ul></ul><ul><ul><ul><li>Subset of the C99 language with extensions for parallelism </li></ul></ul></ul><ul><ul><ul><li>API for coordinating data and task-based parallel computations </li></ul></ul></ul><ul><ul><ul><li>Numerical requirements based on the IEEE 754 standard </li></ul></ul></ul><ul><ul><ul><li>Interoperability with other Khronos standards such as OpenGL </li></ul></ul></ul><ul><ul><ul><li>An abstraction layer for a diverse range of computational resources </li></ul></ul></ul>
  21. 21. <ul><ul><li>OpenCL also specifies: </li></ul></ul><ul><ul><ul><li>A rich set of built-in functions </li></ul></ul></ul><ul><ul><ul><li>Online or offline compilation and build of compute kernel executables </li></ul></ul></ul><ul><ul><li>Platform Layer API </li></ul></ul><ul><ul><ul><li>Query, select and initalize compute devices </li></ul></ul></ul><ul><ul><ul><li>Create compute contexts and work-queues </li></ul></ul></ul><ul><ul><li>Runtime API </li></ul></ul><ul><ul><ul><li>Execute compute kernels </li></ul></ul></ul><ul><ul><ul><li>Manage scheduling, compute and memory resources </li></ul></ul></ul>
  22. 22. <ul><ul><li>Is OpenCL a golden bullet? </li></ul></ul><ul><ul><ul><li>Possibly not but it’s an excellent place to start </li></ul></ul></ul><ul><ul><li>It’s a well supported Open standard </li></ul></ul><ul><ul><ul><li>OpenCL has complete cross vendor support </li></ul></ul></ul><ul><ul><ul><li>Most are motivated to increase their market share in the HPC and Technical computing market </li></ul></ul></ul><ul><ul><li>Write once, work on many platforms is attractive for ISVs </li></ul></ul><ul><ul><ul><li>The lack of an open standard has certainly slowed adoption of support for heterogeneous systems outside of the academic community for compute intensive applications </li></ul></ul></ul>
  23. 23. <ul><ul><li>When will it be available? </li></ul></ul><ul><ul><ul><li>Khronos TM Group ratified the OpenCL TM 1.0 specification at Siggraph Asia, December 9 th 2008 </li></ul></ul></ul><ul><ul><ul><li>Conformant vendor implementations available in Q3 2009 </li></ul></ul></ul><ul><ul><ul><ul><li>One vendor already has a public beta program </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Others will not be far behind </li></ul></ul></ul></ul><ul><ul><li>What are the principle reasons that make OpenCL attractive? </li></ul></ul><ul><ul><ul><li>No reliance on proprietary programming languages </li></ul></ul></ul><ul><ul><li>Cross vendor compatibility and interoperability </li></ul></ul><ul><ul><li>Cross platform support (Linux, Windows and OS X) </li></ul></ul>
  24. 24. <ul><ul><li>The incentive to support heterogeneous systems has to be a clear business win; so companies who differentiate on innovation are more likely to adopt early </li></ul></ul><ul><ul><li>Many large ISVs have long development cycles and if their licensing model is core or socket based they will have to revise their charging structures </li></ul></ul><ul><ul><li>Heterogeneous computing won’t really hit mainstream, multi-application HPC market without ISV support </li></ul></ul>Observations
  25. 25. Petapath Software development flows on multi-core and heterogeneous systems
  26. 26. Host Software Development Practice (Single Core) <ul><ul><li>Typical host development flow (Rinse, Profile, Repeat) </li></ul></ul><ul><ul><ul><li>Use a naïve implementation (e.g. the infamous triple loop) </li></ul></ul></ul><ul><ul><ul><li>Compile (compiler choice can often be important) </li></ul></ul></ul><ul><ul><ul><li>Profile/Benchmark (use % of peak GFLOPS as a guide) </li></ul></ul></ul><ul><ul><ul><li>Throw some compiler switches </li></ul></ul></ul><ul><ul><ul><li>Repeat </li></ul></ul></ul><ul><ul><li>Some developers don’t get very far into this optimisation process </li></ul></ul><ul><ul><li>Time vs Reward (Does it run fast enough yet?) </li></ul></ul>
  27. 27. Host Software Development Practice (Multi-core) <ul><ul><li>Look for more scalable implementations </li></ul></ul><ul><ul><ul><li>In the multi-core era look for algorithmically scalable solutions </li></ul></ul></ul><ul><ul><ul><li>This usually means looking to leverage architectural features </li></ul></ul></ul><ul><ul><ul><ul><li>e.g. Make sure you are cache friendly and take advantage of Vector/SIMD support </li></ul></ul></ul></ul><ul><ul><ul><li>Compile, Profile/Benchmark (use % of peak GFLOPS as a guide) </li></ul></ul></ul><ul><ul><ul><li>Throw compiler switches but also use compiler directives e.g. OpenMP which can require some changes to code </li></ul></ul></ul><ul><ul><ul><li>The parameter space for these optimisations can be large </li></ul></ul></ul><ul><ul><ul><li>Challenging even for the experienced </li></ul></ul></ul>
  28. 28. Host Software Development Practice (Pitfalls) <ul><ul><li>The ‘memory wall’ is probably the biggest hurdle </li></ul></ul><ul><ul><ul><li>With more cores sharing an already scarce resource in main memory bandwidth, cache hygiene is very important! </li></ul></ul></ul><ul><ul><ul><li>Once you fall out of cache then it is sometimes possible that adding more cores can actually slow down your application </li></ul></ul></ul><ul><ul><ul><li>Effective programming is about optimising bandwidth </li></ul></ul></ul><ul><ul><ul><li>Tools such as Acumem’s SlowSpotter TM are particularly useful! </li></ul></ul></ul><ul><ul><li>Deliberately skipping multi-node development as it’s a whole other subject and deserves it’s own track </li></ul></ul>
  29. 29. Heterogeneous Systems Software Development Practice <ul><ul><li>An implementation tuned for multi-core is a good starting point for porting to an accelerated system </li></ul></ul><ul><ul><ul><li>This is because available concurrency (via multi-threading and asynchronous operations) and data parallel operations will likely have been explicitly exposed </li></ul></ul></ul><ul><ul><ul><ul><li>In all but the most compute bound applications, effective implementations of data parallel problems are usually tuned to maximise cache bandwidth </li></ul></ul></ul></ul><ul><ul><ul><ul><li>And to allow effective loop blocking and strip-mining transformations </li></ul></ul></ul></ul><ul><ul><ul><li>This set of optimisations provide a good template for developing an algorithm on an accelerated system </li></ul></ul></ul>
  30. 30. <ul><ul><li>First ascertain at what limiting factors on the host are: </li></ul></ul><ul><ul><li>Bandwidth </li></ul></ul><ul><ul><ul><li>Is your application bandwidth limited on the host? </li></ul></ul></ul><ul><ul><ul><ul><li>In its most cache/memory friendly implementation does it scale and exhibit good cache behaviour? </li></ul></ul></ul></ul><ul><ul><ul><li>GPU based accelerators have several times the BW to their local memories of even the latest servers </li></ul></ul></ul><ul><ul><ul><ul><li>However accelerators typically have less local memory than the host so large working sets will have to be streamed from the host </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Any significant and repeated data movement to and from the accelerator can often be a gating factor for overall application acceleration </li></ul></ul></ul></ul>
  31. 31. <ul><ul><li>Compute </li></ul></ul><ul><ul><ul><li>Is you application compute limited? </li></ul></ul></ul><ul><ul><ul><ul><li>Is it single precision? </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Single precision is still the clear advantage for GPU based accelerators </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><li>Does your application require double precision? </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>GPU based accelerators have less of a delta over x86 hosts in terms of pure DP performance </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Lower GFLOP/$ and GFLOP/W vs Implementation Complexity </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><li>ClearSpeed has significant advantages in terms of GFLOP/W for applications needing double precision </li></ul></ul></ul></ul>
  32. 32. <ul><ul><li>As for optimal multi-core development </li></ul></ul><ul><ul><ul><li>Make sure you are making the most of the architectural features </li></ul></ul></ul><ul><ul><ul><ul><li>Occupancy vs. Latency hiding </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Shared or local memory accesses </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Consider using other memories (constant, texture etc.) </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><li>Make sure you are maximising external memory bandwidth </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Correct alignment and granularity vital </li></ul></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>Must used coalesced memory accesses </li></ul></ul></ul></ul></ul>General comments on using accelerators
  33. 33. Accelerator Software Development Pitfalls <ul><ul><li>Pay attention to Amdahl’s Law </li></ul></ul><ul><ul><ul><li>Simply put it describes the limit of potential acceleration of an application due to parallelisation </li></ul></ul></ul><ul><ul><ul><ul><li>Applies equally to many multi-core implementations </li></ul></ul></ul></ul><ul><ul><ul><li>As you process the data parallel kernels faster, the data movement and other serial portions of the application start to dominate the actual runtime </li></ul></ul></ul><ul><ul><ul><ul><li>At this point the host interface to the accelerator can now be a bottleneck </li></ul></ul></ul></ul>
  34. 34. Petapath The future - Developing with OpenCL
  35. 35. OpenCL in use <ul><ul><li>The Khronos Group’s conformance requirements for OpenCL will endeavour to ensure correctness of implementation between vendors </li></ul></ul><ul><ul><li>A real challenge for those using OpenCL could well be managing varying performance characteristics of different OpenCL capable platforms </li></ul></ul><ul><ul><li>Even different products by the same vendor may vary </li></ul></ul><ul><ul><li>What works well on a multi-core CPU and efficiently on a massively parallel accelerator will likely vary </li></ul></ul>
  36. 36. <ul><ul><li>How similar is the heterogeneous development environment to traditional host development? </li></ul></ul><ul><ul><li>What tools are there to help the development process? </li></ul></ul><ul><ul><ul><li>Do they all support a similar debug interface? </li></ul></ul></ul><ul><ul><ul><li>Do they all have similar profiling capabilities? </li></ul></ul></ul>Will development methods and tools converge?
  37. 37. <ul><ul><li>Debug </li></ul></ul><ul><ul><ul><li>Hardware gdb support? </li></ul></ul></ul><ul><ul><ul><ul><li>ClearSpeed supports source level debug of C n </li></ul></ul></ul></ul><ul><ul><ul><ul><li>NVIDIA in CUDA 2.1, CELL </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Debug for Brook+ & pre-CUDA 2.1 was via host versions of kernels </li></ul></ul></ul></ul><ul><ul><li>Profiling </li></ul></ul><ul><ul><ul><li>gprof (supported by ClearSpeed in C n ) </li></ul></ul></ul><ul><ul><ul><ul><li>Host API only support for gprof with NVIDIA </li></ul></ul></ul></ul><ul><ul><ul><li>Hardware profiling? </li></ul></ul></ul><ul><ul><ul><ul><li>ClearSpeed has a very sophisticated profiling and debugging environment </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Other profilers currently report a more limited set of information for kernels running on HW </li></ul></ul></ul></ul>
  38. 38. <ul><ul><li>Will these debug and profile tools support OpenCL out of the gate? </li></ul></ul><ul><ul><li>With an open development environment now available, it makes sense to develop cross-platform tools that support OpenCL natively and more importantly across multiple vendors and operating systems </li></ul></ul><ul><ul><li>Not having to use vendor specific tools will increase the likelihood that developers will not spend too much time tuning for each platform </li></ul></ul>What will OpenCL have initially?
  39. 39. ClearSpeed CSX700 All Image Rights reserved by original copyright holders Architectures targeted by OpenCL are similar, but different …
  40. 40. NVIDIA GT200 Image Rights reserved by original copyright holders
  41. 41. AMD RV770 Image Rights reserved by original copyright holders
  42. 42. INTEL LARRABEE Image Rights reserved by original copyright holders
  43. 43. <ul><ul><li>Additional utilities and development tools available to the host based developer: </li></ul></ul><ul><ul><ul><li>Intel ® Compilers, MKL, IPP, VTune, Thread Building Blocks, Thread Checker (and soon Parallel Studio) </li></ul></ul></ul><ul><ul><ul><li>AMD Partner Compilers, CodeAnalyst, ACML </li></ul></ul></ul><ul><ul><ul><li>Acumem SlowSpotter </li></ul></ul></ul><ul><ul><ul><li>Allinea Tools </li></ul></ul></ul><ul><ul><li>And a myriad of other third party tools … </li></ul></ul>Can we look forward to …
  44. 44. Petapath   Questions?

Hinweis der Redaktion

  • So with the scene set for our presentation I’m going to talk a bit about the current state of the art in programming heterogeneous systems (with a summary of what will be used at SARA), as well as taking a look at what the development flow for a heterogeneous system really looks like.
  • So with the scene set for our presentation I’m going to talk a bit about the current state of the art in programming heterogeneous systems (with a summary of what will be used at SARA), as well as taking a look at what the development flow for a heterogeneous system really looks like.
  • At SARA the system is based on ClearSpeed Technology hardware and has the full range of development tools and libraries available
  • The level of support offered by the ClearSpeed SDK for debugging and especially profiling is still well ahead of the best of the rest (for the moment). Host profiling API, allows you to instrument even non-CS specific code and have it displayed in the profiler.
  • So let’s take a look at what makes heterogeneous systems interesting to the user and also some of the issues involved in programming them.
  • If it’s single use it’s much easier to justify the investment in time and money to get the benefits of acceleration If it’s multi-use then the cost benefit analysis is more complicated, but can still be swayed by an obvious imbalance in resource consumption. Are the codes yours, open source or closed source ISV applications? If you have source level access do you have the development expertise and resources?
  • So let’s put closed source applications to one side for a moment. If you have answered yes to “Do you have source access?” and “Do you have the development capabilities?” them, today you will have to decide on one of a number of proprietary development environments.
  • I include OpenCL here because of it’s similarity to existing languages and it’s imminent availability.
  • As with MKL, ACML etc IHVs will usually (but not always) get the best out of their hardware. The Library approach is by far and away easiest for the user because it carries with it the potential to provide acceleration for ISV applications, but there are a number of caveats, such as the requirement for the apps to use standard libraries (such as BLAS, LAPACK, FFTW etc) and dynamic linking (many do not because it reduces the support burden). ClearSpeed has long provided a selection of L3 BLAS support and drop in replacements for many of the most popular LAPACK routines. As you will see, the applicability and effectiveness of this approach is limited by the amount of data that gets moved around vs the compute required (in the case of DGEMM that’s n^3 compute to n^2 data)
  • Ok so we’ve established that proprietary solutions are not ideal for a number of reasons, but even then they have stimulated the interest of the research community and for some cases they still do provide compelling financial advantages to the user. Why do I say ‘inevitably’, well because the pull from both the developers and customers is there. Developers want to innovate, but not all are willing to be locked into single vendor deals for obvious reasons. OpenCL has gained enviable support in a very short period of time and Petapath are members of the Khronos Group and are actively participating on the OpenCL working group.
  • So what, for those of you who are not familiar with it, is OpenCL? It addresses a wide range of systems in a familiar way. Very similar to the existing language and library support from a number IHVs.
  • A very interesting point to note here is that OpenCL can also target multi-core systems. It does this via supporting the SIMD extensions to current x86 cores and exposing this parallelism to the developer in a single open API. It doesn’t provide anything that OpenMP doesn’t apart from a single API and programming interface, but this is the huge benefit for developers.
  • Note that there can be multiple OpenCL compute devices in a single system. Initially this is likely to be the host multi-core backend and a single vendor’s accelerator but the potential is there for supporting multiple accelerators and incrementally accelerating your systems.
  • So this all sounds great, but when will I be able to use OpenCL. And it’s a 1.0 spec shouldn’t I watch to see what happens for a little bit?
  • Note that I said earlier that there could be multiple OpenCL supported devices in a system. Well interoperability between different vendor’s implementations will be the key to this.
  • So having mapped out what people use today, and what standards we may have in the near future what does development on a heterogeneous system look like today?
  • Well if you’re here then there is probably a financial or scientific imperative to make the application run faster. HVs also provide optimised (BLAS, LAPACK, etc.) so use where you can Many compilers enable support for SSE2+ and auto-parallelisation Does it run fast enough yet? (Where can I go next if it doesn’t?)
  • The list of general and vendor specific tips is too long to go into here.
  • Wake up Tim I’m expecting a heckle here!
  • So having mapped out what people use today, and what standards we may have in the near future what does development on a heterogeneous system look like today?
  • So will all vendors hardware behave the same? How will the performance vary on different platforms?
  • (clearspeed gdb support and has done for about four years)
  • (clearspeed gdb support and has done for about four years)
  • There are many tools that developers rely on for host development and I think that means there will be space for a thriving ecosystem of third party tools for OpenCL