SlideShare a Scribd company logo
1 of 21
Download to read offline
1
CUDA 6.0
Performance Report
April 2014
2
CUDA 6 Performance Report
CUDART CUDA Runtime Library
cuFFT Fast Fourier Transforms Library
cuBLAS Complete BLAS Library
cuSPARSE Sparse Matrix Library
cuRAND Random Number Generation (RNG) Library
NPP Performance Primitives for Image & Video Processing
Thrust Templated Parallel Algorithms & Data Structures
math.h C99 floating-point Library
Included in the CUDA Toolkit (free download):
developer.nvidia.com/cuda-toolkit
For more information on CUDA libraries:
developer.nvidia.com/gpu-accelerated-libraries
3
0
5
10
15
20
25
30
35
40
CUDA 5 CUDA 6 CUDA 5 CUDA 6
Back to Back Launches Launch and Synchronize
usec
1.5x
Faster
2.0x
Faster
CUDA 6: 2x Faster GPU Kernel Launches
Performance may vary based on OS version and motherboard configuration • CUDA 5.0 and CUDA 6.0 on Tesla K20
a<<<...>>>;
b<<<...>>>;
a<<<...>>>;
cudaDeviceSynchronize();
Dynamic Parallel Kernel Launches
4
cuFFT: Multi-dimensional FFTs
Real and complex
Single- and double-precision data types
1D, 2D and 3D batched transforms
Flexible input and output data layouts
New in
CUDA 6
XT interface supports dual-GPU cards
(Tesla K10, GeForce GTX690, …)
5
cuFFT: up to 700 GFLOPS
Performance may vary based on OS version and motherboard configuration
0
100
200
300
400
500
600
700
800
1 3 5 7 9 11 13 15 17 19 21 23 25
GFLOPS
log2(transform_size)
Single Precision
0
50
100
150
200
250
300
1 3 5 7 9 11 13 15 17 19 21 23 25
GFLOPS
log2(transform_size)
Double Precision
• cuFFT 6.0 on K40c, ECC ON, 32M elements, input and output data on device
1D Complex, Batched FFTs
Used in Audio Processing and as a Foundation for 2D and 3D FFTs
6
cuFFT: Consistently High Performance
0
100
200
300
400
500
600
700
800
1 100 10,000 1,000,000 100,000,000
GFLOPS
Transform Size
Single Precision
0
50
100
150
200
250
300
1 100 10,000 1,000,000 100,000,000
GFLOPS
Transform Size
Double Precision
• cuFFT 6.0 on K40c, ECC ON, 28M-33M elements, input and output data on devicePerformance may vary based on OS version and motherboard configuration
1D Complex, Batched FFTs
Used in Audio Processing and as a Foundation for 2D and 3D FFTs
7
0
20
40
60
80
100
120
140
cuFFT cuFFT-XT
ExecutionTime(ms)
0
1
2
3
4
5
6
7
8
9
10
cuFFT cuFFT-XT
ExecutionTime(ms)
cuFFT-XT: Boosts Performance on K10
• cuFFT 6.0 on K10, ECC ON, input and output data on devicePerformance may vary based on OS version and motherboard configuration
25%
Faster
65%
Faster
New in
CUDA 6
256x256x256 512x512x512
8
cuBLAS: Dense Linear Algebra on GPUs
Complete BLAS implementation plus useful extensions
Supports all 152 standard routines for single, double,
complex, and double complex
Host and device-callable interface
New in
CUDA 6
XT Interface for Level 3 BLAS
Distributed computations across multiple GPUs
Out-of-core streaming to GPU, no upper limit on matrix size
“Drop-in” BLAS intercepts CPU BLAS calls, streams to GPU
9
>3 TFLOPS single-precision
>1 TFLOPS double-precision
Performance may vary based on OS version and motherboard configuration
• cuBLAS 6.0 on K40m, ECC ON, input and output data on device
• m=n=k=4096, transpose=no, side=right, fill=lower
cuBLAS:
0
500
1000
1500
2000
2500
3000
3500
SGEMM
SSYMM
STRSM
SSYRK
CGEMM
CSYMM
CTRSM
CSYRK
DGEMM
DSYMM
DTRSM
DSYRK
ZGEMM
ZSYMM
ZTRSM
ZSYRK
Single Single Complex Double Double Complex
GFLOPS
10
cuBLAS: ZGEMM 5x Faster than MKL
• cuBLAS 6.0 on K40m, ECC ON, input and output data on device
• MKL 11.0.4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHzPerformance may vary based on OS version and motherboard configuration
0
200
400
600
800
1000
1200
0 500 1000 1500 2000 2500 3000 3500 4000
GFLOPS
Matrix Dimension (m=n=k)
cuBLAS
MKL
11
cuBLAS-XT: Multi-GPU Performance Scaling
• cuBLAS-XT 6.0 on K10, ECC ON, input and output data on hostPerformance may vary based on OS version and motherboard configuration
2.2 TFLOPS
4.2 TFLOPS
6.0 TFLOPS
7.9 TFLOPS
0
1
2
3
4
5
6
7
8
1 x K10 2 x K10 3 x K10 4 x K10
New in
CUDA 6
16K x 16K SGEMM on Tesla K10
12
cuSPARSE: Sparse linear algebra routines
Optimized sparse linear algebra BLAS routines – matrix-
vector, matrix-matrix, triangular solve
Support for variety of formats (CSR, COO, block variants)
1.0
2.0
3.0
4.0
y1
y2
y3
y4
alpha + beta
1.0
6.0
4.0
7.0
3.02.0
5.0
y1
y2
y3
y4
New in
CUDA 6
Many improvements to triangular solvers,
Incomplete-LU, and Cholesky preconditioners
13
cuSPARSE: 5x Faster than MKL
• Average of s/c/d/z routines
• cuSPARSE 6.0 on K40m, ECC ON, input and output data on device
• MKL 11.0.4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz
• Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/Performance may vary based on OS version and motherboard configuration
0x
1x
2x
3x
4x
5x
6x
SpeedupoverMKL
Sparse Matrix x Dense Vector (SpMV)
14
cuRAND: Random Number Generation
Generating high quality random numbers in parallel is hard
Don’t do it yourself, use a library!
Pseudo- and Quasi-RNGs
Supports several output distributions
Statistical test results in documentation
New in
CUDA 6 Mersenne Twister 19937
15
cuRAND: Up to 75x Faster vs. Intel MKL
• cuRAND 6.0 on K40c, ECC ON, double-precision input and output data on device
• MKL 11.0.1 on Intel SandyBridge 6-core E5-2620 @ 2.0 GHzPerformance may vary based on OS version and motherboard configuration
0
2
4
6
8
10
12
14
16
Sobol32 MRG32k3a Sobol32 MRG32k3a Sobol32 MRG32k3a
Uniform Distribution Normal Distribution Log-Normal Distribution
Gsamples/sec
cuRAND
MKL
16
cuRAND: High Performance RNGs
Performance may vary based on OS version and motherboard configuration • cuRAND 6.0 on K40m, ECC ON, double precision input and output data on device
0
2
4
6
8
10
12
14
16
18
XORWOW Philox MRG32k3a MTGP32 Sobol32
Scrambled
Sobol64
Scrambled
Pseudo-random Quasi-random
Gsamples/sec
Uniform Distribution Normal Distribution Log-Normal Distribution
17
NPP: NVIDIA Performance Primitives
Over 5000 image and signal processing routines:
color transforms, geometric transforms, move operations, linear
filters, image & signal statistics, image & signal arithmetic, JPEG
building blocks, image segmentation
New in
CUDA 6
Over 500 new routines, including:
median filter, BGR/YUV conversion, 3D LUT color conversion,
improvements to JPEG primitives, plus many more
18
NPP Speedup vs. Intel IPP
Performance may vary based on OS version and motherboard configuration
• NPP 6.0 on K40m, input and output data on device
• IPP 7.0 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz
5.7x
14.4x
17.8x
6.3x
12.9x
28.5x
0x
5x
10x
15x
20x
25x
30x
Image Set
(8-bit RGB)
Image Set Channel
(8-bit RGB)
Image Resize
(8-bit RGB)
Image Gaussian Filter
(32-bit float)
Color Conversion
8-bit YUV422 to 8-bit
RGB
JPEG 8x8 Forward
DCT
19
CUDA C++ Template Library
Template library for CUDA C++
Host and Device Containers that mimic the C++ STL
Optimized Algorithms for sort, reduce, scan, etc.
OpenMP Backend for portability
Also available on github: thrust.github.com
Allows applications and prototypes to be built quickly
20
Thrust Performance vs. Intel TBB
Performance may vary based on OS version and motherboard configuration
• Thrust v1.7.1 on K40m, ECC ON, input and output data on device
• TBB 4.2 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz
0x
2x
4x
6x
8x
10x
12x
14x
reduce transform scan sort
Speedup
Thrust vs. TBB on 32M integers
43.7x
24.8x
13.3x
5.2x
15.0x
5.8x
0x
10x
20x
30x
40x
50x
char short int long float double
Speedup
Thrust Sort vs. TBB on 32M samples
21
math.h: C99 floating-point library + extras
CUDA math.h is industry proven, high performance, accurate
•Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes)
•Exponentials: exp, exp2, log, log2, log10, ...
•Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh, ...
•Special functions: lgamma, tgamma, erf, erfc
•Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs, ...
•Extras: rsqrt, rcbrt, exp10, sinpi, sincos[pi], cospi, erfinv, erfcinv, normcdf[inv],...
New in
CUDA 6
• Over 80 new SIMD instructions
• Useful for video processing: _v*2, _v*4
• Cylindrical bessel: cyl_i{0,1}
• 1/hypotenuse: rhypot

More Related Content

What's hot

SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)Kohei KaiGai
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Ural-PDC
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwJan Holčapek
 
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUDelivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUAMD
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningMilind Koyande
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStromKohei KaiGai
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?Shinnosuke Furuya
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsJoshua Mora
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesIO Visor Project
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlapinside-BigData.com
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorLarry Lang
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 

What's hot (20)

PG-Strom
PG-StromPG-Strom
PG-Strom
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 
PROSE
PROSEPROSE
PROSE
 
Latest HPC News from NVIDIA
Latest HPC News from NVIDIALatest HPC News from NVIDIA
Latest HPC News from NVIDIA
 
Cuda
CudaCuda
Cuda
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUDelivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and Tuning
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
R&D work on pre exascale HPC systems
R&D work on pre exascale HPC systemsR&D work on pre exascale HPC systems
R&D work on pre exascale HPC systems
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challenges
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
dCUDA: Distributed GPU Computing with Hardware Overlap
 dCUDA: Distributed GPU Computing with Hardware Overlap dCUDA: Distributed GPU Computing with Hardware Overlap
dCUDA: Distributed GPU Computing with Hardware Overlap
 
Evolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO VisorEvolving Virtual Networking with IO Visor
Evolving Virtual Networking with IO Visor
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 

Viewers also liked

Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]Michael Zhang
 
2014 Hpocon 姚仁捷 唯品会 - data driven ops
2014 Hpocon 姚仁捷   唯品会 - data driven ops2014 Hpocon 姚仁捷   唯品会 - data driven ops
2014 Hpocon 姚仁捷 唯品会 - data driven opsMichael Zhang
 
2014 GITC 帶上數據去創業 talkingdata—高铎
 2014 GITC 帶上數據去創業 talkingdata—高铎 2014 GITC 帶上數據去創業 talkingdata—高铎
2014 GITC 帶上數據去創業 talkingdata—高铎Michael Zhang
 
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性Michael Zhang
 
Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Michael Zhang
 
HKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier ArchitectureHKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier ArchitectureMichael Zhang
 
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]Michael Zhang
 
Lvs在大规模网络环境下的应用pukong
Lvs在大规模网络环境下的应用pukongLvs在大规模网络环境下的应用pukong
Lvs在大规模网络环境下的应用pukongMichael Zhang
 
Erlang分布式系统的的领域语言
Erlang分布式系统的的领域语言Erlang分布式系统的的领域语言
Erlang分布式系统的的领域语言Feng Yu
 
廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐Michael Zhang
 

Viewers also liked (10)

Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]
 
2014 Hpocon 姚仁捷 唯品会 - data driven ops
2014 Hpocon 姚仁捷   唯品会 - data driven ops2014 Hpocon 姚仁捷   唯品会 - data driven ops
2014 Hpocon 姚仁捷 唯品会 - data driven ops
 
2014 GITC 帶上數據去創業 talkingdata—高铎
 2014 GITC 帶上數據去創業 talkingdata—高铎 2014 GITC 帶上數據去創業 talkingdata—高铎
2014 GITC 帶上數據去創業 talkingdata—高铎
 
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性
 
Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.
 
HKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier ArchitectureHKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier Architecture
 
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
 
Lvs在大规模网络环境下的应用pukong
Lvs在大规模网络环境下的应用pukongLvs在大规模网络环境下的应用pukong
Lvs在大规模网络环境下的应用pukong
 
Erlang分布式系统的的领域语言
Erlang分布式系统的的领域语言Erlang分布式系统的的领域语言
Erlang分布式系统的的领域语言
 
廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐
 

Similar to Cuda 6 performance_report

Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLinside-BigData.com
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxQ1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxMemory Fabric Forum
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univainside-BigData.com
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9inside-BigData.com
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfMuhammadAbdullah311866
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesDustin Franklin
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
POLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overviewPOLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overviewAlexander Grudanov
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...PROIDEA
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PC Cluster Consortium
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoEmbarcados
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5Steen Larsen
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAFacultad de Informática UCM
 
MT58 High performance graphics for VDI: A technical discussion
MT58 High performance graphics for VDI: A technical discussionMT58 High performance graphics for VDI: A technical discussion
MT58 High performance graphics for VDI: A technical discussionDell EMC World
 

Similar to Cuda 6 performance_report (20)

Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptxQ1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
Q1 Memory Fabric Forum: Using CXL with AI Applications - Steve Scargall.pptx
 
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with UnivaNVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univa
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdfNVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
NVIDIA DGX User Group 1st Meet Up_30 Apr 2021.pdf
 
Jetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous MachinesJetson AGX Xavier and the New Era of Autonomous Machines
Jetson AGX Xavier and the New Era of Autonomous Machines
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
uCluster
uClusteruCluster
uCluster
 
POLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overviewPOLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overview
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
 
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mãoWebinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
Webinar: NVIDIA JETSON – A Inteligência Artificial na palma de sua mão
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
 
MT58 High performance graphics for VDI: A technical discussion
MT58 High performance graphics for VDI: A technical discussionMT58 High performance graphics for VDI: A technical discussion
MT58 High performance graphics for VDI: A technical discussion
 

More from Michael Zhang

Fastsocket Linxiaofeng
Fastsocket LinxiaofengFastsocket Linxiaofeng
Fastsocket LinxiaofengMichael Zhang
 
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践
2014 Hpocon 李志刚   1号店 - puppet在1号店的实践2014 Hpocon 李志刚   1号店 - puppet在1号店的实践
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践Michael Zhang
 
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用Michael Zhang
 
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控Michael Zhang
 
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试Michael Zhang
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and HadoopMichael Zhang
 
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]Michael Zhang
 
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]Michael Zhang
 
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]Michael Zhang
 
Q con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyQ con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyMichael Zhang
 
Q con shanghai2013-赵永明-ats与cdn实践
Q con shanghai2013-赵永明-ats与cdn实践Q con shanghai2013-赵永明-ats与cdn实践
Q con shanghai2013-赵永明-ats与cdn实践Michael Zhang
 
Q con shanghai2013- 荣先乾-qzone_touch跨终端优化_v2.0
Q con shanghai2013- 荣先乾-qzone_touch跨终端优化_v2.0Q con shanghai2013- 荣先乾-qzone_touch跨终端优化_v2.0
Q con shanghai2013- 荣先乾-qzone_touch跨终端优化_v2.0Michael Zhang
 
Q con shanghai2013-黄慧攀-又拍云cdn技术探秘
Q con shanghai2013-黄慧攀-又拍云cdn技术探秘Q con shanghai2013-黄慧攀-又拍云cdn技术探秘
Q con shanghai2013-黄慧攀-又拍云cdn技术探秘Michael Zhang
 
Jedex stec DRAM Module Market Overview
Jedex stec DRAM Module Market  OverviewJedex stec DRAM Module Market  Overview
Jedex stec DRAM Module Market OverviewMichael Zhang
 
Percona live linux filesystems and my sql
Percona live   linux filesystems and my sqlPercona live   linux filesystems and my sql
Percona live linux filesystems and my sqlMichael Zhang
 
Velocity china2012kit life on edge —— 如何使用 esi 完成任务
Velocity china2012kit life on edge —— 如何使用 esi 完成任务Velocity china2012kit life on edge —— 如何使用 esi 完成任务
Velocity china2012kit life on edge —— 如何使用 esi 完成任务Michael Zhang
 
加快互联网核心协议,提高Web速度yuchungcheng
加快互联网核心协议,提高Web速度yuchungcheng加快互联网核心协议,提高Web速度yuchungcheng
加快互联网核心协议,提高Web速度yuchungchengMichael Zhang
 
前瞻性Web性能优化pwpo
前瞻性Web性能优化pwpo前瞻性Web性能优化pwpo
前瞻性Web性能优化pwpoMichael Zhang
 

More from Michael Zhang (20)

Fastsocket Linxiaofeng
Fastsocket LinxiaofengFastsocket Linxiaofeng
Fastsocket Linxiaofeng
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践
2014 Hpocon 李志刚   1号店 - puppet在1号店的实践2014 Hpocon 李志刚   1号店 - puppet在1号店的实践
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践
 
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用
 
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控
 
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and Hadoop
 
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
 
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
 
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
 
Q con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyQ con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodology
 
Q con shanghai2013-赵永明-ats与cdn实践
Q con shanghai2013-赵永明-ats与cdn实践Q con shanghai2013-赵永明-ats与cdn实践
Q con shanghai2013-赵永明-ats与cdn实践
 
Q con shanghai2013- 荣先乾-qzone_touch跨终端优化_v2.0
Q con shanghai2013- 荣先乾-qzone_touch跨终端优化_v2.0Q con shanghai2013- 荣先乾-qzone_touch跨终端优化_v2.0
Q con shanghai2013- 荣先乾-qzone_touch跨终端优化_v2.0
 
Q con shanghai2013-黄慧攀-又拍云cdn技术探秘
Q con shanghai2013-黄慧攀-又拍云cdn技术探秘Q con shanghai2013-黄慧攀-又拍云cdn技术探秘
Q con shanghai2013-黄慧攀-又拍云cdn技术探秘
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Jedex stec DRAM Module Market Overview
Jedex stec DRAM Module Market  OverviewJedex stec DRAM Module Market  Overview
Jedex stec DRAM Module Market Overview
 
Percona live linux filesystems and my sql
Percona live   linux filesystems and my sqlPercona live   linux filesystems and my sql
Percona live linux filesystems and my sql
 
Velocity china2012kit life on edge —— 如何使用 esi 完成任务
Velocity china2012kit life on edge —— 如何使用 esi 完成任务Velocity china2012kit life on edge —— 如何使用 esi 完成任务
Velocity china2012kit life on edge —— 如何使用 esi 完成任务
 
加快互联网核心协议,提高Web速度yuchungcheng
加快互联网核心协议,提高Web速度yuchungcheng加快互联网核心协议,提高Web速度yuchungcheng
加快互联网核心协议,提高Web速度yuchungcheng
 
前瞻性Web性能优化pwpo
前瞻性Web性能优化pwpo前瞻性Web性能优化pwpo
前瞻性Web性能优化pwpo
 

Recently uploaded

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Cuda 6 performance_report

  • 2. 2 CUDA 6 Performance Report CUDART CUDA Runtime Library cuFFT Fast Fourier Transforms Library cuBLAS Complete BLAS Library cuSPARSE Sparse Matrix Library cuRAND Random Number Generation (RNG) Library NPP Performance Primitives for Image & Video Processing Thrust Templated Parallel Algorithms & Data Structures math.h C99 floating-point Library Included in the CUDA Toolkit (free download): developer.nvidia.com/cuda-toolkit For more information on CUDA libraries: developer.nvidia.com/gpu-accelerated-libraries
  • 3. 3 0 5 10 15 20 25 30 35 40 CUDA 5 CUDA 6 CUDA 5 CUDA 6 Back to Back Launches Launch and Synchronize usec 1.5x Faster 2.0x Faster CUDA 6: 2x Faster GPU Kernel Launches Performance may vary based on OS version and motherboard configuration • CUDA 5.0 and CUDA 6.0 on Tesla K20 a<<<...>>>; b<<<...>>>; a<<<...>>>; cudaDeviceSynchronize(); Dynamic Parallel Kernel Launches
  • 4. 4 cuFFT: Multi-dimensional FFTs Real and complex Single- and double-precision data types 1D, 2D and 3D batched transforms Flexible input and output data layouts New in CUDA 6 XT interface supports dual-GPU cards (Tesla K10, GeForce GTX690, …)
  • 5. 5 cuFFT: up to 700 GFLOPS Performance may vary based on OS version and motherboard configuration 0 100 200 300 400 500 600 700 800 1 3 5 7 9 11 13 15 17 19 21 23 25 GFLOPS log2(transform_size) Single Precision 0 50 100 150 200 250 300 1 3 5 7 9 11 13 15 17 19 21 23 25 GFLOPS log2(transform_size) Double Precision • cuFFT 6.0 on K40c, ECC ON, 32M elements, input and output data on device 1D Complex, Batched FFTs Used in Audio Processing and as a Foundation for 2D and 3D FFTs
  • 6. 6 cuFFT: Consistently High Performance 0 100 200 300 400 500 600 700 800 1 100 10,000 1,000,000 100,000,000 GFLOPS Transform Size Single Precision 0 50 100 150 200 250 300 1 100 10,000 1,000,000 100,000,000 GFLOPS Transform Size Double Precision • cuFFT 6.0 on K40c, ECC ON, 28M-33M elements, input and output data on devicePerformance may vary based on OS version and motherboard configuration 1D Complex, Batched FFTs Used in Audio Processing and as a Foundation for 2D and 3D FFTs
  • 7. 7 0 20 40 60 80 100 120 140 cuFFT cuFFT-XT ExecutionTime(ms) 0 1 2 3 4 5 6 7 8 9 10 cuFFT cuFFT-XT ExecutionTime(ms) cuFFT-XT: Boosts Performance on K10 • cuFFT 6.0 on K10, ECC ON, input and output data on devicePerformance may vary based on OS version and motherboard configuration 25% Faster 65% Faster New in CUDA 6 256x256x256 512x512x512
  • 8. 8 cuBLAS: Dense Linear Algebra on GPUs Complete BLAS implementation plus useful extensions Supports all 152 standard routines for single, double, complex, and double complex Host and device-callable interface New in CUDA 6 XT Interface for Level 3 BLAS Distributed computations across multiple GPUs Out-of-core streaming to GPU, no upper limit on matrix size “Drop-in” BLAS intercepts CPU BLAS calls, streams to GPU
  • 9. 9 >3 TFLOPS single-precision >1 TFLOPS double-precision Performance may vary based on OS version and motherboard configuration • cuBLAS 6.0 on K40m, ECC ON, input and output data on device • m=n=k=4096, transpose=no, side=right, fill=lower cuBLAS: 0 500 1000 1500 2000 2500 3000 3500 SGEMM SSYMM STRSM SSYRK CGEMM CSYMM CTRSM CSYRK DGEMM DSYMM DTRSM DSYRK ZGEMM ZSYMM ZTRSM ZSYRK Single Single Complex Double Double Complex GFLOPS
  • 10. 10 cuBLAS: ZGEMM 5x Faster than MKL • cuBLAS 6.0 on K40m, ECC ON, input and output data on device • MKL 11.0.4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHzPerformance may vary based on OS version and motherboard configuration 0 200 400 600 800 1000 1200 0 500 1000 1500 2000 2500 3000 3500 4000 GFLOPS Matrix Dimension (m=n=k) cuBLAS MKL
  • 11. 11 cuBLAS-XT: Multi-GPU Performance Scaling • cuBLAS-XT 6.0 on K10, ECC ON, input and output data on hostPerformance may vary based on OS version and motherboard configuration 2.2 TFLOPS 4.2 TFLOPS 6.0 TFLOPS 7.9 TFLOPS 0 1 2 3 4 5 6 7 8 1 x K10 2 x K10 3 x K10 4 x K10 New in CUDA 6 16K x 16K SGEMM on Tesla K10
  • 12. 12 cuSPARSE: Sparse linear algebra routines Optimized sparse linear algebra BLAS routines – matrix- vector, matrix-matrix, triangular solve Support for variety of formats (CSR, COO, block variants) 1.0 2.0 3.0 4.0 y1 y2 y3 y4 alpha + beta 1.0 6.0 4.0 7.0 3.02.0 5.0 y1 y2 y3 y4 New in CUDA 6 Many improvements to triangular solvers, Incomplete-LU, and Cholesky preconditioners
  • 13. 13 cuSPARSE: 5x Faster than MKL • Average of s/c/d/z routines • cuSPARSE 6.0 on K40m, ECC ON, input and output data on device • MKL 11.0.4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz • Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/Performance may vary based on OS version and motherboard configuration 0x 1x 2x 3x 4x 5x 6x SpeedupoverMKL Sparse Matrix x Dense Vector (SpMV)
  • 14. 14 cuRAND: Random Number Generation Generating high quality random numbers in parallel is hard Don’t do it yourself, use a library! Pseudo- and Quasi-RNGs Supports several output distributions Statistical test results in documentation New in CUDA 6 Mersenne Twister 19937
  • 15. 15 cuRAND: Up to 75x Faster vs. Intel MKL • cuRAND 6.0 on K40c, ECC ON, double-precision input and output data on device • MKL 11.0.1 on Intel SandyBridge 6-core E5-2620 @ 2.0 GHzPerformance may vary based on OS version and motherboard configuration 0 2 4 6 8 10 12 14 16 Sobol32 MRG32k3a Sobol32 MRG32k3a Sobol32 MRG32k3a Uniform Distribution Normal Distribution Log-Normal Distribution Gsamples/sec cuRAND MKL
  • 16. 16 cuRAND: High Performance RNGs Performance may vary based on OS version and motherboard configuration • cuRAND 6.0 on K40m, ECC ON, double precision input and output data on device 0 2 4 6 8 10 12 14 16 18 XORWOW Philox MRG32k3a MTGP32 Sobol32 Scrambled Sobol64 Scrambled Pseudo-random Quasi-random Gsamples/sec Uniform Distribution Normal Distribution Log-Normal Distribution
  • 17. 17 NPP: NVIDIA Performance Primitives Over 5000 image and signal processing routines: color transforms, geometric transforms, move operations, linear filters, image & signal statistics, image & signal arithmetic, JPEG building blocks, image segmentation New in CUDA 6 Over 500 new routines, including: median filter, BGR/YUV conversion, 3D LUT color conversion, improvements to JPEG primitives, plus many more
  • 18. 18 NPP Speedup vs. Intel IPP Performance may vary based on OS version and motherboard configuration • NPP 6.0 on K40m, input and output data on device • IPP 7.0 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz 5.7x 14.4x 17.8x 6.3x 12.9x 28.5x 0x 5x 10x 15x 20x 25x 30x Image Set (8-bit RGB) Image Set Channel (8-bit RGB) Image Resize (8-bit RGB) Image Gaussian Filter (32-bit float) Color Conversion 8-bit YUV422 to 8-bit RGB JPEG 8x8 Forward DCT
  • 19. 19 CUDA C++ Template Library Template library for CUDA C++ Host and Device Containers that mimic the C++ STL Optimized Algorithms for sort, reduce, scan, etc. OpenMP Backend for portability Also available on github: thrust.github.com Allows applications and prototypes to be built quickly
  • 20. 20 Thrust Performance vs. Intel TBB Performance may vary based on OS version and motherboard configuration • Thrust v1.7.1 on K40m, ECC ON, input and output data on device • TBB 4.2 on Intel IvyBridge 12-core E5-2697 v2 @ 2.70GHz 0x 2x 4x 6x 8x 10x 12x 14x reduce transform scan sort Speedup Thrust vs. TBB on 32M integers 43.7x 24.8x 13.3x 5.2x 15.0x 5.8x 0x 10x 20x 30x 40x 50x char short int long float double Speedup Thrust Sort vs. TBB on 32M samples
  • 21. 21 math.h: C99 floating-point library + extras CUDA math.h is industry proven, high performance, accurate •Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes) •Exponentials: exp, exp2, log, log2, log10, ... •Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh, ... •Special functions: lgamma, tgamma, erf, erfc •Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs, ... •Extras: rsqrt, rcbrt, exp10, sinpi, sincos[pi], cospi, erfinv, erfcinv, normcdf[inv],... New in CUDA 6 • Over 80 new SIMD instructions • Useful for video processing: _v*2, _v*4 • Cylindrical bessel: cyl_i{0,1} • 1/hypotenuse: rhypot