SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
GPUDirect RDMA and Green
Multi-GPU Architectures

GE Intelligent Platforms
Mil/Aero Embedded Computing



Dustin Franklin, GPGPU Applications Engineer
 dustin.franklin@ge.com
 443.310.9812 (Washington, DC)
What this talk is about
 • GPU Autonomy
 • GFLOPS/watt and SWAP
 • Project Denver
 • Exascale
Without GPUDirect
•   In a standard plug & play OS, the two drivers have separate DMA buffers in system memory
•   Three transfers to move data between I/O endpoint and GPU

                                                                  1
                                                                           I/O driver
                                                                             space
                    I/O          PCIe        CPU        System              2
                  endpoint      switch                  memory
                                                                           GPU driver
                                                                             space
                                                                  3
                                            GPU
                                 GPU
                                           memory




              1     I/O endpoint DMA’s into system memory

              2     CPU copies data from I/O driver DMA buffer into GPU DMA buffer

              3     GPU DMA’s from system memory into GPU memory
GPUDirect RDMA
•   I/O endpoint and GPU communicate directly, only one transfer required.
•   Traffic limited to PCIe switch, no CPU involvement in DMA
•   x86 CPU is still necessary to have in the system, to run NVIDIA driver




                              I/O           PCIe          CPU        System
                            endpoint       switch                    memory


                                       1

                                                         GPU
                                            GPU
                                                        memory




                        1     I/O endpoint DMA’s into GPU memory
Endpoints
•   GPUDirect RDMA is flexible and works with a wide range of existing devices
     –   Built on open PCIe standards
     –   Any I/O device that has a PCIe endpoint and DMA engine can utilize GPUDirect RDMA
     – GPUDirect permeates both the frontend ingest and backend interconnects

                 •   FPGAs
                 •   Ethernet / InfiniBand adapters
                 •   Storage devices
                 •   Video capture cards
                 •   PCIe non-transparent (NT) ports


•   It’s free.
     –   Supported in CUDA 5.0 and Kepler
     –   Users can leverage APIs to implement RDMA with 3rd-party endpoints in their system


•   Practically no integration required
     –   No changes to device HW
     –   No changes to CUDA algorithms
     –   I/O device drivers need to use DMA addresses of GPU instead of SYSRAM pages
Frontend Ingest
                                IPN251
                                    8GB                                     8GB
                                                          INTEL
                                    DDR3                                    DDR3
                                                        Ivybridge
                                                        quad-core
                                  ICS1572                                  384-core
                                    XMC                                      GPU
                                                 gen1               gen3
                                   XYLINX         x8      PCIe       x16   NVIDIA
                                   Virtex6               switch            Kepler
                                                              gen3 x8
                                   A      A                                 2GB
                                   D      D         MELLANOX
                                   C      C         ConnectX-3             GDDR5



                                   RF signals    InfiniBand   10GigE



                             DMA latency (µs)                                             DMA throughput (MB/s)
     DMA Size
                   no RDMA             with RDMA                    Δ                 no RDMA              with RDMA

        16 KB     65.06 µs             4.09 µs                ↓15.9X                  125 MB/s           2000 MB/s

        32 KB     77.38 µs             8.19 µs                 ↓9.5X                  211 MB/s           2000 MB/s

        64 KB    124.03 µs          16.38 µs                   ↓7.6X                  264 MB/s           2000 MB/s

      128 KB     208.26 µs          32.76 µs                   ↓6.4X                  314 MB/s           2000 MB/s

      256 KB     373.57 µs          65.53 µs                   ↓5.7X                  350 MB/s           2000 MB/s

      512 KB     650.52 µs        131.07 µs                    ↓5.0X                  402 MB/s           2000 MB/s

     1024 KB    1307.90 µs        262.14 µs                    ↓4.9X                  400 MB/s           2000 MB/s

     2048 KB    2574.33 µs        524.28 µs                    ↓4.9X                  407 MB/s           2000 MB/s
Frontend Ingest
                                IPN251
                                    8GB                                     8GB
                                                          INTEL
                                    DDR3                                    DDR3
                                                        Ivybridge
                                                        quad-core
                                  ICS1572                                  384-core
                                    XMC                                      GPU
                                                 gen1               gen3
                                   XYLINX         x8      PCIe       x16   NVIDIA
                                   Virtex6               switch            Kepler
                                                              gen3 x8
                                   A      A                                 2GB
                                   D      D         MELLANOX
                                   C      C         ConnectX-3             GDDR5



                                   RF signals    InfiniBand   10GigE



                             DMA latency (µs)                                             DMA throughput (MB/s)
     DMA Size
                   no RDMA             with RDMA                    Δ                 no RDMA              with RDMA

        16 KB     65.06 µs             4.09 µs                ↓15.9X                  125 MB/s           2000 MB/s

        32 KB     77.38 µs             8.19 µs                 ↓9.5X                  211 MB/s           2000 MB/s

        64 KB    124.03 µs          16.38 µs                   ↓7.6X                  264 MB/s           2000 MB/s

      128 KB     208.26 µs          32.76 µs                   ↓6.4X                  314 MB/s           2000 MB/s

      256 KB     373.57 µs          65.53 µs                   ↓5.7X                  350 MB/s           2000 MB/s

      512 KB     650.52 µs        131.07 µs                    ↓5.0X                  402 MB/s           2000 MB/s

     1024 KB    1307.90 µs        262.14 µs                    ↓4.9X                  400 MB/s           2000 MB/s

     2048 KB    2574.33 µs        524.28 µs                    ↓4.9X                  407 MB/s           2000 MB/s
Pipeline with GPUDirect RDMA
FPGA
         block 0                 block 1                block 2        block 3        block
ADC
FPGA
               block 0                 block 1               block 2        block 3       b
DMA
 GPU
                                  block 0               block 1        block 2        block
CUDA
GPU
                                                        block 0        block 1        block
DMA




        FPGA
                   Transfer block directly to GPU via PCIe switch
        DMA
         GPU       CUDA DSP kernels (FIR, FFT, ect.)
        CUDA
        GPU
                   Transfer results to next processor
        DMA
Pipeline without GPUDirect RDMA
FPGA
         block 0                 block 1                block 2             block 3             block
ADC
FPGA
               block 0                 block 1               block 2             block 3            b
DMA
GPU                                      block 0                       block 1
DMA
 GPU
                                                                  block 0                       block
CUDA
GPU
                                                                                      block 0
DMA



        FPGA
                   Transfer block to system memory
        DMA
        GPU
                   Transfer from system memory to GPU
        DMA
         GPU       CUDA DSP kernels (FIR, FFT, ect.)
        CUDA
        GPU
                   Transfer results to next processor
        DMA
Backend Interconnects
•   Utilize GPUDirect RDMA across the network for low-latency IPC and system scalability




•   Mellanox OFED integration with GPUDirect RDMA – Q2 2013
Topologies
 •   GPUDirect RDMA works in many system topologies

      –   Single I/O endpoint and single GPU
      –   Single I/O endpoint and multiple GPUs
                                                                with or without PCIe switch downstream of CPU
      –   Multiple I/O endpoints and single GPU
      –   Multiple I/O endpoints and multiple GPUs


                                                                                 System
                                                                CPU
                                                                                 memory




                                                              PCIe
                                                             switch




                                                       I/O              I/O
                                        GPU                                          GPU
                                                     endpoint         endpoint




                                       GPU                                           GPU
                                      memory                                        memory
                                                     stream           stream


                                       DMA pipeline #1                   DMA pipeline #2
Impacts of GPUDirect
 •   Decreased latency
      –   Eliminate redundant copies over PCIe + added latency from CPU
      –   ~5x reduction, depending on the I/O endpoint
      –   Perform round-trip operations on GPU in microseconds, not milliseconds




                                enables new CUDA applications

 •   Increased PCIe efficiency + bandwidth
      –   bypass system memory, MMU, root complex: limiting factors for GPU DMA transfers

 •   Decrease in CPU utilization
      –   CPU is no longer burning cycles shuffling data around for the GPU
      –   System memory is no longer being thrashed by DMA engines on endpoint + GPU
      –   Go from 100% core utilization per GPU to < 10% utilization per GPU




                                     promotes multi-GPU
Before GPUDirect…
•   Many multi-GPU systems had 2 GPU nodes
•   Additional GPUs quickly choked system memory and CPU resources

•   Dual-socket CPU design very common
•   Dual root-complex prevents P2P across CPUs (QPI/HT untraversable)


                                           System     System
                                           memory     memory



                               I/O
                                            CPU         CPU
                             endpoint




                                            GPU         GPU




                                           GFLOPS/watt
                                        Xeon E5     K20X          system
                      SGEMM              2.32       12.34            8.45
                      DGEMM              1.14         5.19           3.61
Rise of Multi-GPU
 •   Higher GPU-to-CPU ratios permitted by increased GPU autonomy from GPUDirect
 •   PCIe switches integral to design, for true CPU bypass and fully-connected P2P



                                                                 System
                                                   CPU
                                                                 memory



                                  I/O              PCIe
                                endpoint          switch




                                 GPU        GPU            GPU       GPU




                                           GFLOPS/watt
                       GPU:CPU ratio            1 to 1              4 to 1
                         SGEMM                    8.45               10.97
                         DGEMM                    3.61                4.64
Nested PCIe switches
 •   Nested hierarchies avert 96-lane limit on current PCIe switches


                                                                 System
                                                   CPU
                                                                 memory



                                                   PCIe
                                                  switch




          I/O               PCIe                                              PCIe               I/O
        endpoint           switch                                            switch            endpoint




          GPU        GPU            GPU     GPU            GPU         GPU            GPU       GPU




                                                  GFLOPS/watt
                        GPU:CPU ratio           1 to 1            4 to 1              8 to 1
                           SGEMM                  8.45            10.97                11.61
                           DGEMM                  3.61             4.64                 4.91
PCIe over Fiber
     •     SWaP is our new limiting factor
     •     PCIe over Fiber-Optic can interconnect expansion blades and assure GPU:CPU growth
     •     Supports PCIe gen3 over at least 100 meters

                                                                                 system
                                                                    CPU          memory



                                                                   PCIe
                                                                  switch


                                                              PCIe over Fiber

                            1                                     2                                           3                             4
  I/O         PCIe                 I/O         PCIe                          I/O           PCIe                     I/O         PCIe
endpoint     switch              endpoint     switch                       endpoint       switch                  endpoint     switch




 GPU       GPU    GPU    GPU     GPU        GPU    GPU        GPU          GPU        GPU      GPU      GPU       GPU        GPU    GPU   GPU




                                                                GFLOPS/watt
                         GPU:CPU ratio             1 to 1                 4 to 1                   8 to 1           16 to 1
                          SGEMM                        8.45               10.97                    11.61             11.96
                          DGEMM                        3.61                 4.64                    4.91                5.04
Scalability – 10 petaflops
                Processor Power Consumption for 10 petaflops
    GPU:CPU ratio     1 to 1          4 to 1          8 to 1      16 to 1
     SGEMM           1184 kW         911 kW           862 kW       832 kW
     DGEMM           2770 kW         2158 kW        2032 kW      1982 kW

                               Yearly Energy Bill
    GPU:CPU ratio     1 to 1          4 to 1          8 to 1      16 to 1
     SGEMM          $1,050,326      $808,148        $764,680     $738,067
     DGEMM          $2,457,266     $1,914,361       $1,802,586   $1,758,232


                               Efficiency Savings
    GPU:CPU ratio     1 to 1          4 to 1          8 to 1      16 to 1
     SGEMM              --           23.05%          27.19%       29.73%
     DGEMM              --           22.09%          26.64%       28.45%
Road to Exascale

       GPUDirect RDMA                Project Denver
          Integration with             Hybrid CPU/GPU
    peripherals & interconnects              for
       for system scalability      heterogeneous compute




        Project Osprey                  Software

    Parallel microarchitecture &   CUDA, OpenCL, OpenACC
     fab process optimizations      Drivers & Mgmt Layer
        for power efficiency             Hybrid O/S
Project Denver
 •   64-bit ARMv8 architecture
 •   ISA by ARM, chip by NVIDIA
 •   ARM’s RISC-based approach aligns with NVIDIA’s perf/Watt initiative




 •   Unlike licensed Cortex-A53 and –A57 cores, NVIDIA’s cores are highly customized
 •   Design flexibility required for tight CPU/GPU integration
ceepie-geepie

                Memory controller          Memory controller         Memory controller



                                    NPN240
       h.264

                    ARM




                                                                                         DVI
                                     SM                  SM              SM

                    ARM
       crypto




                                                                                         DVI
                      L2                              L2 cache
       SATA




                                                                                         DVI
                    ARM

                                     SM                  SM              SM
       USB




                                                                                         DVI
                    ARM



                                     Grid management + scheduler

                                      PCIe root complex / endpoint
Mem ctrl         Mem ctrl       Mem ctrl

ceepie-geepie




                   crypto h.264
                                   ARM       NPN




                                                                             DVI
                                             SM          SM        SM
                                   ARM       240




                                                                             DVI
                                    L2               L2 cache




                   SATA




                                                                             DVI
                                   ARM
                                              SM         SM        SM




                   USB




                                                                             DVI
                                   ARM

                                         Grid management + scheduler
                                              PCIe root complex


          I/O                                       PCIe
        endpoint                                   switch



                                                PCIe endpoint
                                         Grid management + scheduler
                   crypto h.264




                                   ARM       NPN



                                                                             DVI
                                             SM         SM         SM
                                   ARM       240


                                                                             DVI
                                    L2               L2 cache
                   SATA




                                   ARM                                       DVI
                                              SM        SM         SM
                   USB




                                                                             DVI

                                   ARM

                                  Mem ctrl         Mem ctrl       Mem ctrl
Impacts of Project Denver
 •   Heterogeneous compute
      –   share mixed work efficiently between CPU/GPU
      –   unified memory subsystem


 •   Decreased latency
      –   no PCIe required for synchronization + command queuing


 •   Power efficiency
      –   ARMv8 ISA
      –   on-chip buses & interconnects
      –   “4+1” power scaling


 •   Natively boot & run operating systems

 •   Direct connectivity with peripherals

 •   Maximize multi-GPU
Project Osprey
 •   DARPA PERFECT – 75 GFLOPS/watt
Summary
• System-wide integration of GPUDirect RDMA
• Increase GPU:CPU ratio for better GFLOPS/watt
• Utilize PCIe switches instead of dual-socket CPU/IOH


• Exascale compute – what’s good for HPC is good for embedded/mobile
• Denver & Osprey – what’s good for embedded/mobile is good for HPC



                           questions?

                            booth 201

Weitere ähnliche Inhalte

Was ist angesagt?

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors Rebekah Rodriguez
 
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr..."Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...Edge AI and Vision Alliance
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmapinside-BigData.com
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBabak Farrokhi
 
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013Gluster.org
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelDivye Kapoor
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntuSim Janghoon
 
HPC Market Update and Observations on Big Memory
HPC Market Update and Observations on Big MemoryHPC Market Update and Observations on Big Memory
HPC Market Update and Observations on Big MemoryMemVerge
 
Slideshare - linux crypto
Slideshare - linux cryptoSlideshare - linux crypto
Slideshare - linux cryptoJin Wu
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
DDN EXA 5 - Innovation at Scale
DDN EXA 5 - Innovation at ScaleDDN EXA 5 - Innovation at Scale
DDN EXA 5 - Innovation at Scaleinside-BigData.com
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival GuideKernel TLV
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)self employed
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPFAlex Maestretti
 

Was ist angesagt? (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
 
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr..."Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...
 
InfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and RoadmapInfiniBand In-Network Computing Technology and Roadmap
InfiniBand In-Network Computing Technology and Roadmap
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Block I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktraceBlock I/O Layer Tracing: blktrace
Block I/O Layer Tracing: blktrace
 
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013GlusterFs Architecture & Roadmap - LinuxCon EU 2013
GlusterFs Architecture & Roadmap - LinuxCon EU 2013
 
Gpu
GpuGpu
Gpu
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
Kvm performance optimization for ubuntu
Kvm performance optimization for ubuntuKvm performance optimization for ubuntu
Kvm performance optimization for ubuntu
 
eBPF maps 101
eBPF maps 101eBPF maps 101
eBPF maps 101
 
HPC Market Update and Observations on Big Memory
HPC Market Update and Observations on Big MemoryHPC Market Update and Observations on Big Memory
HPC Market Update and Observations on Big Memory
 
Cuda
CudaCuda
Cuda
 
Slideshare - linux crypto
Slideshare - linux cryptoSlideshare - linux crypto
Slideshare - linux crypto
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
RDMA on ARM
RDMA on ARMRDMA on ARM
RDMA on ARM
 
DDN EXA 5 - Innovation at Scale
DDN EXA 5 - Innovation at ScaleDDN EXA 5 - Innovation at Scale
DDN EXA 5 - Innovation at Scale
 
DMA Survival Guide
DMA Survival GuideDMA Survival Guide
DMA Survival Guide
 
GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)GRAPHICS PROCESSING UNIT (GPU)
GRAPHICS PROCESSING UNIT (GPU)
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 

Andere mochten auch

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - ConfooSirKetchup
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...npinto
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jancstalks
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...AMD Developer Central
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of ComputingIntel Nervana
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux ClubOfer Rosenberg
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteNVIDIA
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Introduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntroduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntel Nervana
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUNur Ahmadi
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architectureCHIHTE LU
 

Andere mochten auch (20)

Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Cliff sugerman
Cliff sugermanCliff sugerman
Cliff sugerman
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
CSTalks - GPGPU - 19 Jan
CSTalks  -  GPGPU - 19 JanCSTalks  -  GPGPU - 19 Jan
CSTalks - GPGPU - 19 Jan
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
 
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...
 
Gpgpu
GpgpuGpgpu
Gpgpu
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Nervana and the Future of Computing
Nervana and the Future of ComputingNervana and the Future of Computing
Nervana and the Future of Computing
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
GPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 KeynoteGPU Technology Conference 2014 Keynote
GPU Technology Conference 2014 Keynote
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Introduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will ConstableIntroduction to Deep Learning with Will Constable
Introduction to Deep Learning with Will Constable
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
E-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPUE-Learning: Introduction to GPGPU
E-Learning: Introduction to GPGPU
 
Introduction to gpu architecture
Introduction to gpu architectureIntroduction to gpu architecture
Introduction to gpu architecture
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 

Ähnlich wie GPUDirect RDMA and Green Multi-GPU Architectures

Case Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded ProcessorsCase Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded Processorsaccount inactive
 
Power Mac G5 ( Late 2005) Technical Specifications
Power  Mac  G5 ( Late 2005)    Technical  SpecificationsPower  Mac  G5 ( Late 2005)    Technical  Specifications
Power Mac G5 ( Late 2005) Technical SpecificationsSocial Media Marketing
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
ArcGIS Server a Brief Synopsis
ArcGIS Server a Brief SynopsisArcGIS Server a Brief Synopsis
ArcGIS Server a Brief Synopsisewug
 
Introducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin ProcessorsIntroducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin ProcessorsAnalog Devices, Inc.
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhSaurabh Kumar
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Presentation power point (Advertising Upgrade))
Presentation power point (Advertising Upgrade))Presentation power point (Advertising Upgrade))
Presentation power point (Advertising Upgrade))andrew maybir
 
Advertising System Upgrade
Advertising System UpgradeAdvertising System Upgrade
Advertising System Upgradeandrew maybir
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 
Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationxKinAnx
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 HardwareJacob Wu
 
Stream Processing
Stream ProcessingStream Processing
Stream Processingarnamoy10
 
VMworld 2013: How Good is PCoIP - A Remoting Protocol Shootout
VMworld 2013: How Good is PCoIP - A Remoting Protocol ShootoutVMworld 2013: How Good is PCoIP - A Remoting Protocol Shootout
VMworld 2013: How Good is PCoIP - A Remoting Protocol ShootoutVMworld
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrjRoberto Brandao
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)Naoto MATSUMOTO
 

Ähnlich wie GPUDirect RDMA and Green Multi-GPU Architectures (20)

PG-Strom
PG-StromPG-Strom
PG-Strom
 
Case Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded ProcessorsCase Study: Porting Qt for Embedded Linux on Embedded Processors
Case Study: Porting Qt for Embedded Linux on Embedded Processors
 
Power Mac G5 ( Late 2005) Technical Specifications
Power  Mac  G5 ( Late 2005)    Technical  SpecificationsPower  Mac  G5 ( Late 2005)    Technical  Specifications
Power Mac G5 ( Late 2005) Technical Specifications
 
Power Mac G5 ( Late 2005) Technical Specifications
Power  Mac  G5 ( Late 2005)    Technical  SpecificationsPower  Mac  G5 ( Late 2005)    Technical  Specifications
Power Mac G5 ( Late 2005) Technical Specifications
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
ArcGIS Server a Brief Synopsis
ArcGIS Server a Brief SynopsisArcGIS Server a Brief Synopsis
ArcGIS Server a Brief Synopsis
 
Introducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin ProcessorsIntroducing the ADSP BF609 Blackfin Processors
Introducing the ADSP BF609 Blackfin Processors
 
Graphics Processing Unit by Saurabh
Graphics Processing Unit by SaurabhGraphics Processing Unit by Saurabh
Graphics Processing Unit by Saurabh
 
Beagle board
Beagle boardBeagle board
Beagle board
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Presentation power point (Advertising Upgrade))
Presentation power point (Advertising Upgrade))Presentation power point (Advertising Upgrade))
Presentation power point (Advertising Upgrade))
 
Advertising System Upgrade
Advertising System UpgradeAdvertising System Upgrade
Advertising System Upgrade
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 
Ibm power7
Ibm power7Ibm power7
Ibm power7
 
Sun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentationSun sparc enterprise t5440 server technical presentation
Sun sparc enterprise t5440 server technical presentation
 
Exaflop In 2018 Hardware
Exaflop In 2018   HardwareExaflop In 2018   Hardware
Exaflop In 2018 Hardware
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
VMworld 2013: How Good is PCoIP - A Remoting Protocol Shootout
VMworld 2013: How Good is PCoIP - A Remoting Protocol ShootoutVMworld 2013: How Good is PCoIP - A Remoting Protocol Shootout
VMworld 2013: How Good is PCoIP - A Remoting Protocol Shootout
 
Amd accelerated computing -ufrj
Amd   accelerated computing -ufrjAmd   accelerated computing -ufrj
Amd accelerated computing -ufrj
 
In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)In-Network Acceleration with FPGA (MEMO)
In-Network Acceleration with FPGA (MEMO)
 

Mehr von inside-BigData.com

Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networksinside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoringinside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecastsinside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuninginside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODinside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Accelerationinside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficientlyinside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Erainside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computinginside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Clusterinside-BigData.com
 

Mehr von inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Kürzlich hochgeladen

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Kürzlich hochgeladen (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

GPUDirect RDMA and Green Multi-GPU Architectures

  • 1. GPUDirect RDMA and Green Multi-GPU Architectures GE Intelligent Platforms Mil/Aero Embedded Computing Dustin Franklin, GPGPU Applications Engineer dustin.franklin@ge.com 443.310.9812 (Washington, DC)
  • 2. What this talk is about • GPU Autonomy • GFLOPS/watt and SWAP • Project Denver • Exascale
  • 3. Without GPUDirect • In a standard plug & play OS, the two drivers have separate DMA buffers in system memory • Three transfers to move data between I/O endpoint and GPU 1 I/O driver space I/O PCIe CPU System 2 endpoint switch memory GPU driver space 3 GPU GPU memory 1 I/O endpoint DMA’s into system memory 2 CPU copies data from I/O driver DMA buffer into GPU DMA buffer 3 GPU DMA’s from system memory into GPU memory
  • 4. GPUDirect RDMA • I/O endpoint and GPU communicate directly, only one transfer required. • Traffic limited to PCIe switch, no CPU involvement in DMA • x86 CPU is still necessary to have in the system, to run NVIDIA driver I/O PCIe CPU System endpoint switch memory 1 GPU GPU memory 1 I/O endpoint DMA’s into GPU memory
  • 5. Endpoints • GPUDirect RDMA is flexible and works with a wide range of existing devices – Built on open PCIe standards – Any I/O device that has a PCIe endpoint and DMA engine can utilize GPUDirect RDMA – GPUDirect permeates both the frontend ingest and backend interconnects • FPGAs • Ethernet / InfiniBand adapters • Storage devices • Video capture cards • PCIe non-transparent (NT) ports • It’s free. – Supported in CUDA 5.0 and Kepler – Users can leverage APIs to implement RDMA with 3rd-party endpoints in their system • Practically no integration required – No changes to device HW – No changes to CUDA algorithms – I/O device drivers need to use DMA addresses of GPU instead of SYSRAM pages
  • 6. Frontend Ingest IPN251 8GB 8GB INTEL DDR3 DDR3 Ivybridge quad-core ICS1572 384-core XMC GPU gen1 gen3 XYLINX x8 PCIe x16 NVIDIA Virtex6 switch Kepler gen3 x8 A A 2GB D D MELLANOX C C ConnectX-3 GDDR5 RF signals InfiniBand 10GigE DMA latency (µs) DMA throughput (MB/s) DMA Size no RDMA with RDMA Δ no RDMA with RDMA 16 KB 65.06 µs 4.09 µs ↓15.9X 125 MB/s 2000 MB/s 32 KB 77.38 µs 8.19 µs ↓9.5X 211 MB/s 2000 MB/s 64 KB 124.03 µs 16.38 µs ↓7.6X 264 MB/s 2000 MB/s 128 KB 208.26 µs 32.76 µs ↓6.4X 314 MB/s 2000 MB/s 256 KB 373.57 µs 65.53 µs ↓5.7X 350 MB/s 2000 MB/s 512 KB 650.52 µs 131.07 µs ↓5.0X 402 MB/s 2000 MB/s 1024 KB 1307.90 µs 262.14 µs ↓4.9X 400 MB/s 2000 MB/s 2048 KB 2574.33 µs 524.28 µs ↓4.9X 407 MB/s 2000 MB/s
  • 7. Frontend Ingest IPN251 8GB 8GB INTEL DDR3 DDR3 Ivybridge quad-core ICS1572 384-core XMC GPU gen1 gen3 XYLINX x8 PCIe x16 NVIDIA Virtex6 switch Kepler gen3 x8 A A 2GB D D MELLANOX C C ConnectX-3 GDDR5 RF signals InfiniBand 10GigE DMA latency (µs) DMA throughput (MB/s) DMA Size no RDMA with RDMA Δ no RDMA with RDMA 16 KB 65.06 µs 4.09 µs ↓15.9X 125 MB/s 2000 MB/s 32 KB 77.38 µs 8.19 µs ↓9.5X 211 MB/s 2000 MB/s 64 KB 124.03 µs 16.38 µs ↓7.6X 264 MB/s 2000 MB/s 128 KB 208.26 µs 32.76 µs ↓6.4X 314 MB/s 2000 MB/s 256 KB 373.57 µs 65.53 µs ↓5.7X 350 MB/s 2000 MB/s 512 KB 650.52 µs 131.07 µs ↓5.0X 402 MB/s 2000 MB/s 1024 KB 1307.90 µs 262.14 µs ↓4.9X 400 MB/s 2000 MB/s 2048 KB 2574.33 µs 524.28 µs ↓4.9X 407 MB/s 2000 MB/s
  • 8. Pipeline with GPUDirect RDMA FPGA block 0 block 1 block 2 block 3 block ADC FPGA block 0 block 1 block 2 block 3 b DMA GPU block 0 block 1 block 2 block CUDA GPU block 0 block 1 block DMA FPGA Transfer block directly to GPU via PCIe switch DMA GPU CUDA DSP kernels (FIR, FFT, ect.) CUDA GPU Transfer results to next processor DMA
  • 9. Pipeline without GPUDirect RDMA FPGA block 0 block 1 block 2 block 3 block ADC FPGA block 0 block 1 block 2 block 3 b DMA GPU block 0 block 1 DMA GPU block 0 block CUDA GPU block 0 DMA FPGA Transfer block to system memory DMA GPU Transfer from system memory to GPU DMA GPU CUDA DSP kernels (FIR, FFT, ect.) CUDA GPU Transfer results to next processor DMA
  • 10. Backend Interconnects • Utilize GPUDirect RDMA across the network for low-latency IPC and system scalability • Mellanox OFED integration with GPUDirect RDMA – Q2 2013
  • 11. Topologies • GPUDirect RDMA works in many system topologies – Single I/O endpoint and single GPU – Single I/O endpoint and multiple GPUs with or without PCIe switch downstream of CPU – Multiple I/O endpoints and single GPU – Multiple I/O endpoints and multiple GPUs System CPU memory PCIe switch I/O I/O GPU GPU endpoint endpoint GPU GPU memory memory stream stream DMA pipeline #1 DMA pipeline #2
  • 12. Impacts of GPUDirect • Decreased latency – Eliminate redundant copies over PCIe + added latency from CPU – ~5x reduction, depending on the I/O endpoint – Perform round-trip operations on GPU in microseconds, not milliseconds enables new CUDA applications • Increased PCIe efficiency + bandwidth – bypass system memory, MMU, root complex: limiting factors for GPU DMA transfers • Decrease in CPU utilization – CPU is no longer burning cycles shuffling data around for the GPU – System memory is no longer being thrashed by DMA engines on endpoint + GPU – Go from 100% core utilization per GPU to < 10% utilization per GPU promotes multi-GPU
  • 13. Before GPUDirect… • Many multi-GPU systems had 2 GPU nodes • Additional GPUs quickly choked system memory and CPU resources • Dual-socket CPU design very common • Dual root-complex prevents P2P across CPUs (QPI/HT untraversable) System System memory memory I/O CPU CPU endpoint GPU GPU GFLOPS/watt Xeon E5 K20X system SGEMM 2.32 12.34 8.45 DGEMM 1.14 5.19 3.61
  • 14. Rise of Multi-GPU • Higher GPU-to-CPU ratios permitted by increased GPU autonomy from GPUDirect • PCIe switches integral to design, for true CPU bypass and fully-connected P2P System CPU memory I/O PCIe endpoint switch GPU GPU GPU GPU GFLOPS/watt GPU:CPU ratio 1 to 1 4 to 1 SGEMM 8.45 10.97 DGEMM 3.61 4.64
  • 15. Nested PCIe switches • Nested hierarchies avert 96-lane limit on current PCIe switches System CPU memory PCIe switch I/O PCIe PCIe I/O endpoint switch switch endpoint GPU GPU GPU GPU GPU GPU GPU GPU GFLOPS/watt GPU:CPU ratio 1 to 1 4 to 1 8 to 1 SGEMM 8.45 10.97 11.61 DGEMM 3.61 4.64 4.91
  • 16. PCIe over Fiber • SWaP is our new limiting factor • PCIe over Fiber-Optic can interconnect expansion blades and assure GPU:CPU growth • Supports PCIe gen3 over at least 100 meters system CPU memory PCIe switch PCIe over Fiber 1 2 3 4 I/O PCIe I/O PCIe I/O PCIe I/O PCIe endpoint switch endpoint switch endpoint switch endpoint switch GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GFLOPS/watt GPU:CPU ratio 1 to 1 4 to 1 8 to 1 16 to 1 SGEMM 8.45 10.97 11.61 11.96 DGEMM 3.61 4.64 4.91 5.04
  • 17. Scalability – 10 petaflops Processor Power Consumption for 10 petaflops GPU:CPU ratio 1 to 1 4 to 1 8 to 1 16 to 1 SGEMM 1184 kW 911 kW 862 kW 832 kW DGEMM 2770 kW 2158 kW 2032 kW 1982 kW Yearly Energy Bill GPU:CPU ratio 1 to 1 4 to 1 8 to 1 16 to 1 SGEMM $1,050,326 $808,148 $764,680 $738,067 DGEMM $2,457,266 $1,914,361 $1,802,586 $1,758,232 Efficiency Savings GPU:CPU ratio 1 to 1 4 to 1 8 to 1 16 to 1 SGEMM -- 23.05% 27.19% 29.73% DGEMM -- 22.09% 26.64% 28.45%
  • 18. Road to Exascale GPUDirect RDMA Project Denver Integration with Hybrid CPU/GPU peripherals & interconnects for for system scalability heterogeneous compute Project Osprey Software Parallel microarchitecture & CUDA, OpenCL, OpenACC fab process optimizations Drivers & Mgmt Layer for power efficiency Hybrid O/S
  • 19. Project Denver • 64-bit ARMv8 architecture • ISA by ARM, chip by NVIDIA • ARM’s RISC-based approach aligns with NVIDIA’s perf/Watt initiative • Unlike licensed Cortex-A53 and –A57 cores, NVIDIA’s cores are highly customized • Design flexibility required for tight CPU/GPU integration
  • 20. ceepie-geepie Memory controller Memory controller Memory controller NPN240 h.264 ARM DVI SM SM SM ARM crypto DVI L2 L2 cache SATA DVI ARM SM SM SM USB DVI ARM Grid management + scheduler PCIe root complex / endpoint
  • 21. Mem ctrl Mem ctrl Mem ctrl ceepie-geepie crypto h.264 ARM NPN DVI SM SM SM ARM 240 DVI L2 L2 cache SATA DVI ARM SM SM SM USB DVI ARM Grid management + scheduler PCIe root complex I/O PCIe endpoint switch PCIe endpoint Grid management + scheduler crypto h.264 ARM NPN DVI SM SM SM ARM 240 DVI L2 L2 cache SATA ARM DVI SM SM SM USB DVI ARM Mem ctrl Mem ctrl Mem ctrl
  • 22. Impacts of Project Denver • Heterogeneous compute – share mixed work efficiently between CPU/GPU – unified memory subsystem • Decreased latency – no PCIe required for synchronization + command queuing • Power efficiency – ARMv8 ISA – on-chip buses & interconnects – “4+1” power scaling • Natively boot & run operating systems • Direct connectivity with peripherals • Maximize multi-GPU
  • 23. Project Osprey • DARPA PERFECT – 75 GFLOPS/watt
  • 24. Summary • System-wide integration of GPUDirect RDMA • Increase GPU:CPU ratio for better GFLOPS/watt • Utilize PCIe switches instead of dual-socket CPU/IOH • Exascale compute – what’s good for HPC is good for embedded/mobile • Denver & Osprey – what’s good for embedded/mobile is good for HPC questions? booth 201