SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
GTC 2016
Kazuaki Ishizaki (kiszk@acm.org) +, Gita Koblents -,
Alon Shalev Housfater -, Jimmy Kwa -, Marcel Mitran –,
Akihiro Hayashi *, Vivek Sarkar *
+ IBM Research – Tokyo
- IBM Canada
* Rice University
Easy and High Performance
GPU Programming for Java Programmers
1
Java Program Runs on GPU with IBM Java 8
2 Easy and High Performance GPU Programming for Java Programmers
http://www-01.ibm.com/support/docview.wss?uid=swg21696670 https://devblogs.nvidia.com/parallelforall/
next-wave-enterprise-performance-java-power-systems-nvidia-gpus/
Java Meets GPUs
3 Easy and High Performance GPU Programming for Java Programmers
What You Will Learn from this Talk
 How to program GPUs in pure Java
– using standard parallel stream APIs
 How IBM Java 8 runtime executes the parallel program on
GPUs
– with optimizations without annotations
 GPU read-only cache exploitation
 data copy reductions between CPU and GPU
 exception check eliminations for Java
 Achieve good performance results using one K40 card with
– 58.9x over 1-CPU-thread sequential execution on POWER8
– 3.7x over 160-CPU-thread parallel execution on POWER8
4 Easy and High Performance GPU Programming for Java Programmers
Outline
 Goal
 Motivation
 How to Write a Parallel Program in Java
 Overview of IBM Java 8 Runtime
 Performance Evaluation
 Conclusion
5 Easy and High Performance GPU Programming for Java Programmers
Why We Want to Use Java for GPU Programming
 High productivity
– Safety and flexibility
– Good program portability among different machines
 “write once, run anywhere”
– Ease of writing a program
 Hard to use CUDA and OpenCL for non-expert programmers
 Many computation-intensive applications in non-HPC area
– Data analytics and data science (Hadoop, Spark, etc.)
– Security analysis (events in log files)
– Natural language processing (messages in social network system)
6 Easy and High Performance GPU Programming for Java Programmers From https://www.flickr.com/photos/dlato/5530553658
Programmability of CUDA vs. Java for GPUs
 CUDA requires programmers to explicitly write operations for
– managing device memories
– copying data
between CPU and GPU
– expressing parallelism
 Java 8 enables programmers
to just focus on
– expressing parallelism
7 Easy and High Performance GPU Programming for Java Programmers
// code for GPU
__global__ void GPU(float* d_a, float* d_b, int n) {
int i = threadIdx.x;
if (n <= i) return;
d_b[i] = d_a[i] * 2.0;
}
void fooJava(float A[], float B[], int n) {
// similar to for (idx = 0; i < n; i++)
IntStream.range(0, N).parallel().forEach(i -> {
b[i] = a[i] * 2.0;
});
}
void fooCUDA(N, float *A, float *B, int N) {
int sizeN = N * sizeof(float);
cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN);
cudaMemcpy(d_A, A, sizeN, HostToDevice);
GPU<<<N, 1>>>(d_A, d_B, N);
cudaMemcpy(B, d_B, sizeN, DeviceToHost);
cudaFree(d_B); cudaFree(d_A);
}
Safety and Flexibility in Java
 Automatic memory management
– No memory leak
 Object-oriented
 Exception checks
– No unsafe
memory accesses
8 Easy and High Performance GPU Programming for Java Programmers
float[] a = new float[N], b = new float[N]
new Par().foo(a, b, N)
// unnecessary to explicitly free a[] and b[]
class Par {
void foo(float[] a, float[] b, int n) {
// similar to for (idx = 0; i < n; i++)
IntStream.range(0, N).parallel().forEach(i -> {
// throw an exception if
// a[] == null, b[] = null
// i < 0, a.length <= i, b.length <= i
b[i] = a[i] * 2.0;
});
}
}
Portability among Different Hardware
 How a Java program works
– ‘javac’ command creates machine-independent Java bytecode
– ‘java’ command launches Java runtime with Java bytecode
 An interpreter executes a program by processing each Java bytecode
 A just-in-time compiler generates native instructions for a target machine
from Java bytecode of a hotspot method
9 Easy and High Performance GPU Programming for Java Programmers
Java
program
(.java)
Java
bytecode
(.class,
.jar)
Java runtime
Target machine
Interpreter
just-in-time
compiler> javac Seq.java > java Seq
Outline
 Goal
 Motivation
 How to Write a Parallel Program in Java
 Overview of IBM Java 8 Runtime
 Performance Evaluation
 Conclusion
10 Easy and High Performance GPU Programming for Java Programmers
How to Write a Parallel Loop in Java 8
 Express parallelism by using parallel stream APIs
among iterations of a lambda expression (index variable: i)
11 Easy and High Performance GPU Programming for Java Programmers
IntStream.range(0, 5).parallel().
forEach(i -> { System.out.println(i);}); 0
3
2
4
1
Example
Reference implementation of Java 8 can execute this
on multiple CPU threads
println(0) on thread 0
println(3) on thread 1
println(2) on thread 2
println(4) on thread 3
println(1) on thread 0
time
Outline
 Goal
 Motivation
 How to Write and Execute a Parallel Program in Java
 Overview of IBM Java 8 Runtime
 Performance Evaluation
 Conclusion
12 Easy and High Performance GPU Programming for Java Programmers
Portability among Different Hardware (including GPUs)
 A just-in-time compiler in IBM Java 8 runtime generates
native instructions
– for a target machine including GPUs from Java bytecode
– for GPU which exploit device-specific capabilities more easily than
OpenCL
13 Easy and High Performance GPU Programming for Java Programmers
Java
program
(.java)
Java
bytecode
(.class,
.jar)
IBM Java 8 runtime
Target machine
Interpreter
just-in-time
compiler
> javac Par.java > java Par for GPU
IntStream.range(0, n)
.parallel().forEach(i -> {
...
});
IBM Java 8 Can Execute the Code on CPU or GPU
 Generate code for GPU execution from a parallel loop
– GPU instructions for code in blue
– CPU instructions for GPU memory manage and data copy
 Execute this loop on CPU or GPU base on cost model
– e.g., execute this on CPU if ‘n’ is very small
14 Easy and High Performance GPU Programming for Java Programmers
class Par {
void foo(float[] a, float[] b, float[] c, int n) {
IntStream.range(0, n).parallel().forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
}
}
Note: GPU support in current version is limited to lambdas with one-dimensional arrays and primitive types
Optimizations for GPUs in IBM Just-In-Time Compiler
 Using read-only cache
– reduce # of memory transactions to a GPU global memory
 Optimizing data copy between CPU and GPU
– reduce amount of data copy
 Eliminating redundant exception checks for Java on GPU
– reduce # of instructions in GPU binary
15 Easy and High Performance GPU Programming for Java Programmers
Using Read-Only Cache
 Automatically detect a read-only array and access it thru read-
only cache
– read-only cache is faster than other memories in GPU
16 Easy and High Performance GPU Programming for Java Programmers
float[] A = new float[N], B = new float[N], C = new float[N];
foo(A, B, C, N);
void foo(float[] a, float[] b, float[] c, int n) {
IntStream.range(0, n).parallel().forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
}
Equivalent to CUDA code
__device__ foo(*a, *b, *c, N)
b[i] = __ldg(&a[i]) * 2.0;
c[i] = __ldg(&a[i]) * 3.0;
}
Optimizing Data Copy between CPU and GPU
 Eliminate data copy from GPU to CPU
– if an array (e.g., a[]) is not written on GPU
 Eliminate data copy from CPU to GPU
– if an array (e.g., b[] and c[]) is not read on GPU
17 Easy and High Performance GPU Programming for Java Programmers
void foo(float[] a, float[] b, float[] c, int n) {
// Data copy for a[] from CPU to GPU
// No data copy for b[] and c[]
IntStream.range(0, n).parallel().forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
// Data copy for b[] and c[] from GPU to CPU
// No data copy for a[]
}
Optimizing Data Copy between CPU and GPU
 Eliminate data copy between CPU and GPU
– if an array (e.g., a[] and b[]), which was accessed on GPU, is not
accessed on CPU
18 Easy and High Performance GPU Programming for Java Programmers
// Data copy for a[] from CPU to GPU
for (int t = 0; t < T; t++) {
IntStream.range(0, N*N).parallel().forEach(idx -> {
b[idx] = a[...];
});
// No data copy for b[] between GPU and CPU
IntStream.range(0, N*N).parallel().forEach(idx -> {
a[idx] = b[...];
}
// No data copy for a[] between GPU and CPU
}
// Data copy for a[] and b[] from GPU to CPU
How to Support Exception Checks on GPUs
 IBM just-in-time compiler inserts exception checks in GPU
kernel
19 Easy and High Performance GPU Programming for Java Programmers
// code for CPU
{
...
launch GPUkernel(...)
if (exception) {
goto handle_exception;
}
...
}
__device__ GPUkernel(…) {
int i = ...;
if ((a == NULL) || i < 0 || a.length <= i) {
exception = true; return; }
if ((b == NULL) || b.length <= i) {
exception = true; return; }
b[i] = a[i] * 2.0;
if ((c == NULL) || c.length <= i) {
exception = true; return; }
c[i] = a[i] * 3.0;
}
// Java program
IntStream.range(0,n).parallel().
forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
Eliminating Redundant Exception Checks
 Speculatively perform exception checks on CPU if the form of
an array index is simple (xi + y)
20 Easy and High Performance GPU Programming for Java Programmers
// code for CPU
if (
// check conditions for null pointer
a != null && b != null && c != null &&
// check conditions for out of bounds of array index
0 <= a.length && a.length < n &&
0 <= b.length && b.length < n &&
0 <= c.length && c.length < n) {
...
launch GPUkernel(...)
...
} else {
// execute this loop on CPU to produce an exception
}
__device__ GPUkernel(…) {
// no exception check is
// required
i = ...;
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
}
IntStream.range(0,n).parallel().
forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
Outline
 Goal
 Motivation
 How to Write and Execute a Parallel Program in Java
 Overview of IBM Java 8 Runtime
 Performance Evaluation
 Conclusion
21 Easy and High Performance GPU Programming for Java Programmers
Performance Evaluation Methodology
 Measured performance improvement by GPU using four programs (on
next slide) over
– 1-CPU-thread sequential execution
– 160-CPU-thread parallel execution
 Experimental environment used
– IBM Java 8 Service Release 2 for PowerPC Little Endian
 Download for free at http://www.ibm.com/java/jdk/
– Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB
memory (160 hardware threads in total)
 With one NVIDIA Kepler K40m GPU (2880 CUDA cores in total) at 876 MHz
with 12GB global memory (ECC off)
– Ubuntu 14.10, CUDA 5.5
22 Easy and High Performance GPU Programming for Java Programmers
Benchmark Programs
 Prepare sequential and parallel stream API versions in Java
23 Easy and High Performance GPU Programming for Java Programmers
Name Summary Data size Type
MM A dense matrix multiplication: C = A.B 1,024 × 1,024 double
SpMM A sparse matrix multiplication: C = A.B 500,000×
500,000
double
Jacobi2D Solve an equation using the Jacobi method 8,192 × 8,192 double
LifeGame Conway’s game of life. Iterate 10,000 times 512 × 512 byte
Performance Improvements of GPU Version over
Sequential and Parallel CPU Versions
 Achieve 58.9x on geomean and 317.0x for Jacobi2D over 1 CPU thread
 Achieve 3.7x on geomean and 14.8x for Jacobi2D over 160 CPU threads
 Degrade performance for SpMM against 160 CPU threads
Easy and High Performance GPU Programming for Java Programmers24
Conclusion
 Program GPUs using pure Java with standard parallel stream
APIs
 Compile a Java program without annotations for GPUs by IBM
Java 8 runtime with optimizations
– read-only cache exploitation
– data copy optimizations between CPU and GPU
– exception check eliminations
 Offer performance improvements using GPUs by
–58.9x over sequential execution
–3.7x over 160-CPU-thread parallel execution
25 Easy and High Performance GPU Programming for Java Programmers
Details are in our paper “Compiling and Optimizing Java 8 Programs for GPU Execution” (PACT2015)

Weitere ähnliche Inhalte

Was ist angesagt?

PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...AMD Developer Central
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Clusterairbots
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)Kohei KaiGai
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosAMD Developer Central
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerAMD Developer Central
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStromKohei KaiGai
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakDatabricks
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)Kohei KaiGai
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningAMD Developer Central
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorAMD Developer Central
 

Was ist angesagt? (20)

Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor MillerPL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
PL-4043, Accelerating OpenVL for Heterogeneous Platforms, by Gregor Miller
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
 

Ähnlich wie Easy and High Performance GPU Programming for Java Programmers

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?Doug Hawkins
 
HKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overviewHKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overviewLinaro
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
Optimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and VulkanOptimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and Vulkanax inc.
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 

Ähnlich wie Easy and High Performance GPU Programming for Java Programmers (20)

PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?
 
HKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overviewHKG15-300: Art's Quick Compiler: An unofficial overview
HKG15-300: Art's Quick Compiler: An unofficial overview
 
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsPL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Optimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and VulkanOptimizing NN inference performance on Arm NEON and Vulkan
Optimizing NN inference performance on Arm NEON and Vulkan
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Deep Learning Edge
Deep Learning Edge Deep Learning Edge
Deep Learning Edge
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
Defense_Presentation
Defense_PresentationDefense_Presentation
Defense_Presentation
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 

Mehr von Kazuaki Ishizaki

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdfKazuaki Ishizaki
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdfKazuaki Ishizaki
 
Make AI ecosystem more interoperable
Make AI ecosystem more interoperableMake AI ecosystem more interoperable
Make AI ecosystem more interoperableKazuaki Ishizaki
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Kazuaki Ishizaki
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0Kazuaki Ishizaki
 
In-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkIn-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkKazuaki Ishizaki
 
Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Kazuaki Ishizaki
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_publicKazuaki Ishizaki
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_publicKazuaki Ishizaki
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki publicKazuaki Ishizaki
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_publicKazuaki Ishizaki
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_publicKazuaki Ishizaki
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラKazuaki Ishizaki
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化Kazuaki Ishizaki
 

Mehr von Kazuaki Ishizaki (20)

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
 
Make AI ecosystem more interoperable
Make AI ecosystem more interoperableMake AI ecosystem more interoperable
Make AI ecosystem more interoperable
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
 
SparkTokyo2019
SparkTokyo2019SparkTokyo2019
SparkTokyo2019
 
In-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkIn-Memory Evolution in Apache Spark
In-Memory Evolution in Apache Spark
 
icpe2019_ishizaki_public
icpe2019_ishizaki_publicicpe2019_ishizaki_public
icpe2019_ishizaki_public
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
 
Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
 

Kürzlich hochgeladen

SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 

Kürzlich hochgeladen (20)

SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 

Easy and High Performance GPU Programming for Java Programmers

  • 1. GTC 2016 Kazuaki Ishizaki (kiszk@acm.org) +, Gita Koblents -, Alon Shalev Housfater -, Jimmy Kwa -, Marcel Mitran –, Akihiro Hayashi *, Vivek Sarkar * + IBM Research – Tokyo - IBM Canada * Rice University Easy and High Performance GPU Programming for Java Programmers 1
  • 2. Java Program Runs on GPU with IBM Java 8 2 Easy and High Performance GPU Programming for Java Programmers http://www-01.ibm.com/support/docview.wss?uid=swg21696670 https://devblogs.nvidia.com/parallelforall/ next-wave-enterprise-performance-java-power-systems-nvidia-gpus/
  • 3. Java Meets GPUs 3 Easy and High Performance GPU Programming for Java Programmers
  • 4. What You Will Learn from this Talk  How to program GPUs in pure Java – using standard parallel stream APIs  How IBM Java 8 runtime executes the parallel program on GPUs – with optimizations without annotations  GPU read-only cache exploitation  data copy reductions between CPU and GPU  exception check eliminations for Java  Achieve good performance results using one K40 card with – 58.9x over 1-CPU-thread sequential execution on POWER8 – 3.7x over 160-CPU-thread parallel execution on POWER8 4 Easy and High Performance GPU Programming for Java Programmers
  • 5. Outline  Goal  Motivation  How to Write a Parallel Program in Java  Overview of IBM Java 8 Runtime  Performance Evaluation  Conclusion 5 Easy and High Performance GPU Programming for Java Programmers
  • 6. Why We Want to Use Java for GPU Programming  High productivity – Safety and flexibility – Good program portability among different machines  “write once, run anywhere” – Ease of writing a program  Hard to use CUDA and OpenCL for non-expert programmers  Many computation-intensive applications in non-HPC area – Data analytics and data science (Hadoop, Spark, etc.) – Security analysis (events in log files) – Natural language processing (messages in social network system) 6 Easy and High Performance GPU Programming for Java Programmers From https://www.flickr.com/photos/dlato/5530553658
  • 7. Programmability of CUDA vs. Java for GPUs  CUDA requires programmers to explicitly write operations for – managing device memories – copying data between CPU and GPU – expressing parallelism  Java 8 enables programmers to just focus on – expressing parallelism 7 Easy and High Performance GPU Programming for Java Programmers // code for GPU __global__ void GPU(float* d_a, float* d_b, int n) { int i = threadIdx.x; if (n <= i) return; d_b[i] = d_a[i] * 2.0; } void fooJava(float A[], float B[], int n) { // similar to for (idx = 0; i < n; i++) IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; }); } void fooCUDA(N, float *A, float *B, int N) { int sizeN = N * sizeof(float); cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN); cudaMemcpy(d_A, A, sizeN, HostToDevice); GPU<<<N, 1>>>(d_A, d_B, N); cudaMemcpy(B, d_B, sizeN, DeviceToHost); cudaFree(d_B); cudaFree(d_A); }
  • 8. Safety and Flexibility in Java  Automatic memory management – No memory leak  Object-oriented  Exception checks – No unsafe memory accesses 8 Easy and High Performance GPU Programming for Java Programmers float[] a = new float[N], b = new float[N] new Par().foo(a, b, N) // unnecessary to explicitly free a[] and b[] class Par { void foo(float[] a, float[] b, int n) { // similar to for (idx = 0; i < n; i++) IntStream.range(0, N).parallel().forEach(i -> { // throw an exception if // a[] == null, b[] = null // i < 0, a.length <= i, b.length <= i b[i] = a[i] * 2.0; }); } }
  • 9. Portability among Different Hardware  How a Java program works – ‘javac’ command creates machine-independent Java bytecode – ‘java’ command launches Java runtime with Java bytecode  An interpreter executes a program by processing each Java bytecode  A just-in-time compiler generates native instructions for a target machine from Java bytecode of a hotspot method 9 Easy and High Performance GPU Programming for Java Programmers Java program (.java) Java bytecode (.class, .jar) Java runtime Target machine Interpreter just-in-time compiler> javac Seq.java > java Seq
  • 10. Outline  Goal  Motivation  How to Write a Parallel Program in Java  Overview of IBM Java 8 Runtime  Performance Evaluation  Conclusion 10 Easy and High Performance GPU Programming for Java Programmers
  • 11. How to Write a Parallel Loop in Java 8  Express parallelism by using parallel stream APIs among iterations of a lambda expression (index variable: i) 11 Easy and High Performance GPU Programming for Java Programmers IntStream.range(0, 5).parallel(). forEach(i -> { System.out.println(i);}); 0 3 2 4 1 Example Reference implementation of Java 8 can execute this on multiple CPU threads println(0) on thread 0 println(3) on thread 1 println(2) on thread 2 println(4) on thread 3 println(1) on thread 0 time
  • 12. Outline  Goal  Motivation  How to Write and Execute a Parallel Program in Java  Overview of IBM Java 8 Runtime  Performance Evaluation  Conclusion 12 Easy and High Performance GPU Programming for Java Programmers
  • 13. Portability among Different Hardware (including GPUs)  A just-in-time compiler in IBM Java 8 runtime generates native instructions – for a target machine including GPUs from Java bytecode – for GPU which exploit device-specific capabilities more easily than OpenCL 13 Easy and High Performance GPU Programming for Java Programmers Java program (.java) Java bytecode (.class, .jar) IBM Java 8 runtime Target machine Interpreter just-in-time compiler > javac Par.java > java Par for GPU IntStream.range(0, n) .parallel().forEach(i -> { ... });
  • 14. IBM Java 8 Can Execute the Code on CPU or GPU  Generate code for GPU execution from a parallel loop – GPU instructions for code in blue – CPU instructions for GPU memory manage and data copy  Execute this loop on CPU or GPU base on cost model – e.g., execute this on CPU if ‘n’ is very small 14 Easy and High Performance GPU Programming for Java Programmers class Par { void foo(float[] a, float[] b, float[] c, int n) { IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); } } Note: GPU support in current version is limited to lambdas with one-dimensional arrays and primitive types
  • 15. Optimizations for GPUs in IBM Just-In-Time Compiler  Using read-only cache – reduce # of memory transactions to a GPU global memory  Optimizing data copy between CPU and GPU – reduce amount of data copy  Eliminating redundant exception checks for Java on GPU – reduce # of instructions in GPU binary 15 Easy and High Performance GPU Programming for Java Programmers
  • 16. Using Read-Only Cache  Automatically detect a read-only array and access it thru read- only cache – read-only cache is faster than other memories in GPU 16 Easy and High Performance GPU Programming for Java Programmers float[] A = new float[N], B = new float[N], C = new float[N]; foo(A, B, C, N); void foo(float[] a, float[] b, float[] c, int n) { IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); } Equivalent to CUDA code __device__ foo(*a, *b, *c, N) b[i] = __ldg(&a[i]) * 2.0; c[i] = __ldg(&a[i]) * 3.0; }
  • 17. Optimizing Data Copy between CPU and GPU  Eliminate data copy from GPU to CPU – if an array (e.g., a[]) is not written on GPU  Eliminate data copy from CPU to GPU – if an array (e.g., b[] and c[]) is not read on GPU 17 Easy and High Performance GPU Programming for Java Programmers void foo(float[] a, float[] b, float[] c, int n) { // Data copy for a[] from CPU to GPU // No data copy for b[] and c[] IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); // Data copy for b[] and c[] from GPU to CPU // No data copy for a[] }
  • 18. Optimizing Data Copy between CPU and GPU  Eliminate data copy between CPU and GPU – if an array (e.g., a[] and b[]), which was accessed on GPU, is not accessed on CPU 18 Easy and High Performance GPU Programming for Java Programmers // Data copy for a[] from CPU to GPU for (int t = 0; t < T; t++) { IntStream.range(0, N*N).parallel().forEach(idx -> { b[idx] = a[...]; }); // No data copy for b[] between GPU and CPU IntStream.range(0, N*N).parallel().forEach(idx -> { a[idx] = b[...]; } // No data copy for a[] between GPU and CPU } // Data copy for a[] and b[] from GPU to CPU
  • 19. How to Support Exception Checks on GPUs  IBM just-in-time compiler inserts exception checks in GPU kernel 19 Easy and High Performance GPU Programming for Java Programmers // code for CPU { ... launch GPUkernel(...) if (exception) { goto handle_exception; } ... } __device__ GPUkernel(…) { int i = ...; if ((a == NULL) || i < 0 || a.length <= i) { exception = true; return; } if ((b == NULL) || b.length <= i) { exception = true; return; } b[i] = a[i] * 2.0; if ((c == NULL) || c.length <= i) { exception = true; return; } c[i] = a[i] * 3.0; } // Java program IntStream.range(0,n).parallel(). forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; });
  • 20. Eliminating Redundant Exception Checks  Speculatively perform exception checks on CPU if the form of an array index is simple (xi + y) 20 Easy and High Performance GPU Programming for Java Programmers // code for CPU if ( // check conditions for null pointer a != null && b != null && c != null && // check conditions for out of bounds of array index 0 <= a.length && a.length < n && 0 <= b.length && b.length < n && 0 <= c.length && c.length < n) { ... launch GPUkernel(...) ... } else { // execute this loop on CPU to produce an exception } __device__ GPUkernel(…) { // no exception check is // required i = ...; b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; } IntStream.range(0,n).parallel(). forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; });
  • 21. Outline  Goal  Motivation  How to Write and Execute a Parallel Program in Java  Overview of IBM Java 8 Runtime  Performance Evaluation  Conclusion 21 Easy and High Performance GPU Programming for Java Programmers
  • 22. Performance Evaluation Methodology  Measured performance improvement by GPU using four programs (on next slide) over – 1-CPU-thread sequential execution – 160-CPU-thread parallel execution  Experimental environment used – IBM Java 8 Service Release 2 for PowerPC Little Endian  Download for free at http://www.ibm.com/java/jdk/ – Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memory (160 hardware threads in total)  With one NVIDIA Kepler K40m GPU (2880 CUDA cores in total) at 876 MHz with 12GB global memory (ECC off) – Ubuntu 14.10, CUDA 5.5 22 Easy and High Performance GPU Programming for Java Programmers
  • 23. Benchmark Programs  Prepare sequential and parallel stream API versions in Java 23 Easy and High Performance GPU Programming for Java Programmers Name Summary Data size Type MM A dense matrix multiplication: C = A.B 1,024 × 1,024 double SpMM A sparse matrix multiplication: C = A.B 500,000× 500,000 double Jacobi2D Solve an equation using the Jacobi method 8,192 × 8,192 double LifeGame Conway’s game of life. Iterate 10,000 times 512 × 512 byte
  • 24. Performance Improvements of GPU Version over Sequential and Parallel CPU Versions  Achieve 58.9x on geomean and 317.0x for Jacobi2D over 1 CPU thread  Achieve 3.7x on geomean and 14.8x for Jacobi2D over 160 CPU threads  Degrade performance for SpMM against 160 CPU threads Easy and High Performance GPU Programming for Java Programmers24
  • 25. Conclusion  Program GPUs using pure Java with standard parallel stream APIs  Compile a Java program without annotations for GPUs by IBM Java 8 runtime with optimizations – read-only cache exploitation – data copy optimizations between CPU and GPU – exception check eliminations  Offer performance improvements using GPUs by –58.9x over sequential execution –3.7x over 160-CPU-thread parallel execution 25 Easy and High Performance GPU Programming for Java Programmers Details are in our paper “Compiling and Optimizing Java 8 Programs for GPU Execution” (PACT2015)