Conflux: gpgpu for .net (en)

•

0 likes•314 views

Andrei Varanovich

Technology

CONFLUX: GPGPU FOR .NET Eugene Burmako, 2010

Videocards: state of the art Equipment – tenth/hundreds of ALU clocked at ~1 GHz Peak performance – 1 SP TFLOPS, > 100 DP GFLOPS API – random memory access, data structures, pointers, subroutines API maturity – nearly four years, several generations of graphics processors

Videocards: programmer’s PoV Modern GPU programming models (CUDA, AMD Stream, OpenCL, DirectCompute): Parallel algorithm is defined by the pair: 1) kernel (loop iteration), 2) iteration bounds. Kernel is compiled by the driver. Iteration bounds are used to create grid of threads. Input data is copied to video memory. Execution gets kicked off. Result is copied to main memory.

$Example: SAXPY via CUDA __global__ void Saxpy(float a, float* X, float* Y) { inti = blockDim.x * blockIdx.x + threadIdx.x; Y[i] = a * X[i] + Y[i]; } cudaMemcpy(X, hX, cudaMemcpyHostToDevice); cudaMemcpy(Y, hY, cudaMemcpyHostToDevice); Saxpy<<<256, (N + 255) / 256>>>(a, hX, hY); cudaMemcpy(hY, Y, cudaMemcpyDeviceToHost);$

In fact Brahma: Data structures: data parallel array. Computations: C# expressions, LINQ combinators. Accelerator v2: Data structures: data parallel array. Computations: arithmetic operators, number of predefined functions. This does the trick for a lot of algorithms. But what if we’ve got branching or non-regular memory access?

$Example: CUDA interop saxpy = @”__global__ void Saxpy(float a, float* X, float* Y) { inti = blockDim.x * blockIdx.x + threadIdx.x; Y[i] = a * X[i] + Y[i]; }”; nvcuda.cuModuleLoadDataEx(saxpy); nvcuda.cuMemcpyHtoD(X, Y); nvcuda.cuParamSeti(a, X, Y); nvcuda.cuLaunchGrid(256, (N + 255) / 256); nvcuda.cuMemcpyDtoH(Y);$

Conflux Kernels are written in C#: data structures, local variables, branching, loops float a; float[] x; [Result] float[] y; vari = GlobalIdx.X; y[i] = a * x[i] + y[i];

Conflux Avoids explicit interop with unmanaged code, lets programmer use native .NET data types. float[] x, y; varcfg = new CudaConfig(); var kernel = cfg.Configure<Saxpy>(); y = kernel.Execute(a, x, y);

How does it work? Front end: decompiles C#. AST transformer: inlines calls, destructures classes and arrays, maps intrinsincs. Back end:generates PTX (NVIDIA GPU assembler) and/or multicoreIL. Interop: binds to nvcuda driver that is capable of executing GPU assembler.

Current progress http://bitbucket.org/conflux/conflux Proof of concept. Capable of computing hello-world of parallel computations: matrix multiplication. If we don’t take into account [currently]high overhead incurred by JIT-compilation, the idea works finely even for naïve code generator: 1x CPU < 2x CPU << GPU. Triple license: AGPL, exception for OSS projects, commercial.

Future work GPU-specific optimizations (e.g. diagonal stripes for optimizing bandwidth utilization of matrix transposition) Polyhedral model for loop nest optimization (can be configured to fit specific levels and sizes of memory hierarchy, there exist GPU-specific linear heuristics that optimize spatial and temporal locality). Distributed execution (a new level of memory hierarchy if we use polyhedral model).

Conclusion Conflux: GPGPU for .NET http://bitbucket.org/conflux/conflux eugene.burmako@confluxhpc.net

What's hot

General Programming on the GPU - ConfooSirKetchup

C# Assignmet HelpProgramming Homework Help

Efficient SIMD Vectorization for Hashing in OpenCLJonas Traub

Nicety of java 8 multithreading for advanced, Max VoronoySigma Software

GPU Programming on CPU - Using C++AMPMiller Lee

Multilayer Neuronal network hardware implementation Nabil Chouba

C++ amp on linuxMiller Lee

Engineering fast indexesDaniel Lemire

On Mining Bitcoins - Fundamentals & OutlooksFilip Maertens

Rubinius @ RubyAndRails2010Dirkjan Bussink

Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire

Multi qubit entanglementVijayananda Mohire

Fast Wavelet Tree Construction in PracticeRakuten Group, Inc.

AA-sort with SSE4.1MITSUNARI Shigeo

2013 0928 programming by cuda小明王

Cocos2d Performance TipsKeisuke Hata

My bitmapMilruwan Perera

WebAssembly向け多倍長演算の実装MITSUNARI Shigeo

Fast indexes with roaring #gomtl-10 Daniel Lemire

TensorFlow Studying Part II for GPUTe-Yen Liu

What's hot (20)

General Programming on the GPU - Confoo

C# Assignmet Help

Efficient SIMD Vectorization for Hashing in OpenCL

Nicety of java 8 multithreading for advanced, Max Voronoy

GPU Programming on CPU - Using C++AMP

Multilayer Neuronal network hardware implementation

C++ amp on linux

Engineering fast indexes

On Mining Bitcoins - Fundamentals & Outlooks

Rubinius @ RubyAndRails2010

Next Generation Indexes For Big Data Engineering (ODSC East 2018)

Multi qubit entanglement

Fast Wavelet Tree Construction in Practice

AA-sort with SSE4.1

2013 0928 programming by cuda

Cocos2d Performance Tips

My bitmap

WebAssembly向け多倍長演算の実装

Fast indexes with roaring #gomtl-10

TensorFlow Studying Part II for GPU

Similar to Conflux: gpgpu for .net (en)

Intro2 Cuda MoayadMoayadhn

Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.

Vpu technology &gpgpu computingArka Ghosh

CUDA Deep Divekrasul

Vpu technology &gpgpu computingArka Ghosh

Newbie’s guide to_the_gpgpu_universeOfer Rosenberg

Slide tesiNicolò Savioli

Introduction to CUDARaymond Tay

Cuda introductionHanibei

An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35

Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky

Programming languagesDmitry Zinoviev

There is more to CJuraj Michálek

Lecture 6 Kernel Debugging + Ports DevelopmentMohammed Farrag

Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui

Threaded ProgrammingSri Prasanna

gpuprogram_lecture,architecture_designsnARUNACHALAM468781

GPU: Understanding CUDAJoaquín Aparicio Ramos

Similar to Conflux: gpgpu for .net (en) (20)

Intro2 Cuda Moayad

Nvidia cuda tutorial_no_nda_apr08

Vpu technology &gpgpu computing

CUDA Deep Dive

Vpu technology &gpgpu computing

Newbie’s guide to_the_gpgpu_universe

Slide tesi

Introduction to CUDA

Cuda introduction

An Introduction to CUDA-OpenCL - University.pptx

Gpu workshop cluster universe: scripting cuda

Programming languages

There is more to C

Lecture 6 Kernel Debugging + Ports Development

Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...

Threaded Programming

gpuprogram_lecture,architecture_designsn

GPU: Understanding CUDA

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Tech Trends Report 2024 Future Today Institute.pdfhans926745

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

08448380779 Call Girls In Civil Lines Women Seeking Men

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

🐬 The future of MySQL is Postgres 🐘

Boost PC performance: How more available memory can improve productivity

How to Troubleshoot Apps for the Modern Connected Worker

GenCyber Cyber Security Day Presentation

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

08448380779 Call Girls In Friends Colony Women Seeking Men

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

A Domino Admins Adventures (Engage 2024)

Scaling API-first – The story of a global engineering organization

Tech Trends Report 2024 Future Today Institute.pdf

Axa Assurance Maroc - Insurer Innovation Award 2024

presentation ICT roal in 21st century education

Conflux: gpgpu for .net (en)

1. CONFLUX: GPGPU FOR .NET Eugene Burmako, 2010

2. Videocards: state of the art Equipment – tenth/hundreds of ALU clocked at ~1 GHz Peak performance – 1 SP TFLOPS, > 100 DP GFLOPS API – random memory access, data structures, pointers, subroutines API maturity – nearly four years, several generations of graphics processors

3. Videocards: programmer’s PoV Modern GPU programming models (CUDA, AMD Stream, OpenCL, DirectCompute): Parallel algorithm is defined by the pair: 1) kernel (loop iteration), 2) iteration bounds. Kernel is compiled by the driver. Iteration bounds are used to create grid of threads. Input data is copied to video memory. Execution gets kicked off. Result is copied to main memory.

4. Example: SAXPY via CUDA __global__ void Saxpy(float a, float* X, float* Y) { inti = blockDim.x * blockIdx.x + threadIdx.x; Y[i] = a * X[i] + Y[i]; } cudaMemcpy(X, hX, cudaMemcpyHostToDevice); cudaMemcpy(Y, hY, cudaMemcpyHostToDevice); Saxpy<<<256, (N + 255) / 256>>>(a, hX, hY); cudaMemcpy(hY, Y, cudaMemcpyDeviceToHost);

5. Hot question

6. Official answer

7. In fact Brahma: Data structures: data parallel array. Computations: C# expressions, LINQ combinators. Accelerator v2: Data structures: data parallel array. Computations: arithmetic operators, number of predefined functions. This does the trick for a lot of algorithms. But what if we’ve got branching or non-regular memory access?

8. Example: CUDA interop saxpy = @”__global__ void Saxpy(float a, float* X, float* Y) { inti = blockDim.x * blockIdx.x + threadIdx.x; Y[i] = a * X[i] + Y[i]; }”; nvcuda.cuModuleLoadDataEx(saxpy); nvcuda.cuMemcpyHtoD(X, Y); nvcuda.cuParamSeti(a, X, Y); nvcuda.cuLaunchGrid(256, (N + 255) / 256); nvcuda.cuMemcpyDtoH(Y);

9. Conflux Kernels are written in C#: data structures, local variables, branching, loops float a; float[] x; [Result] float[] y; vari = GlobalIdx.X; y[i] = a * x[i] + y[i];

10. Conflux Avoids explicit interop with unmanaged code, lets programmer use native .NET data types. float[] x, y; varcfg = new CudaConfig(); var kernel = cfg.Configure<Saxpy>(); y = kernel.Execute(a, x, y);

11. How does it work? Front end: decompiles C#. AST transformer: inlines calls, destructures classes and arrays, maps intrinsincs. Back end:generates PTX (NVIDIA GPU assembler) and/or multicoreIL. Interop: binds to nvcuda driver that is capable of executing GPU assembler.

12. Current progress http://bitbucket.org/conflux/conflux Proof of concept. Capable of computing hello-world of parallel computations: matrix multiplication. If we don’t take into account [currently]high overhead incurred by JIT-compilation, the idea works finely even for naïve code generator: 1x CPU < 2x CPU << GPU. Triple license: AGPL, exception for OSS projects, commercial.

13. Demo

14. Future work GPU-specific optimizations (e.g. diagonal stripes for optimizing bandwidth utilization of matrix transposition) Polyhedral model for loop nest optimization (can be configured to fit specific levels and sizes of memory hierarchy, there exist GPU-specific linear heuristics that optimize spatial and temporal locality). Distributed execution (a new level of memory hierarchy if we use polyhedral model).

15. Conclusion Conflux: GPGPU for .NET http://bitbucket.org/conflux/conflux eugene.burmako@confluxhpc.net

Conflux: gpgpu for .net (en)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Conflux: gpgpu for .net (en)

Similar to Conflux: gpgpu for .net (en) (20)

Recently uploaded

Recently uploaded (20)

Conflux: gpgpu for .net (en)