SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
Invited talk at 14th Asian Symposium on Programming Languages and
Systems (APLAS 2016)
Kazuaki Ishizaki
IBM Research – Tokyo
Making Hardware Accelerator Easier to Use
1
Hanoi in 1996
 My first visit to Hanoi
– I joined our research project “Java just-in-time compiler” on 1996
– I worked for Parallel Fortran compiler by 1995
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki2
Hanoi in 1996 and 2016
 Drastically changed over twenty years
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki3
1996 2016
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
What has Happened in Computation-Intensive Area for 20 Years
 Program is becoming simpler
 Hardware is becoming complicated
1996 2016
Hardware Fast scalar processors Commodity processors with
hardware accelerators
Applications Weather, wind, fluid, and physics
simulations, chemical synthesis
Machine learning and
deep learning with big data
Program Complicated and
hardware-dependent code
Simple and clean code
(e.g. mapreduce)
Users Limited to programmers
who are well-educated for HPC
Data scientists
who are non-familiar with hardware
Hardware
Examples4
GPUPowerPC
What has Happened in Computation-Intensive Area for 20 Years
 Program is becoming simpler
 Hardware is becoming complicated
Making Hardware Accelerator Easier to Use5
1996 2016
Hardware Fast scalar processors Commodity processors with
hardware accelerators
Applications Weather, wind, fluid, and physics
simulations, chemical synthesis
Machine learning and
deep learning with big data
Program Complicated and
hardware- dependent code
Simple and clean code
(MapReduce)
Users Limited to programmers
who are well-educated for HPC
Data scientists
who are non-familiar with hardware
Hardware
Examples
Bad news:
Gap between hardware and software is bigger
Good news:
Program can be easily analyzed
My Recent Interest
 How system generates hardware accelerator code from
program with high-level abstraction
– Expected (practical) result
 People execute program without knowing usage of hardware accelerator
– Challenge
 How to optimize code for a certain hardware accelerator without specific
information
–On-going research
 GPU exploitation from Java program
 GPU exploitation in Apache Spark
work with Akihiro Hayashi *, Alon Shalev Housfater -, Hiroshi Inoue +,
Madhusudanan Kandasamy  , Gita Koblents -, Moriyoshi Ohara +,
Vivek Sarkar *, and Jan Wroblewski (intern) +
+ IBM Research – Tokyo, - IBM Canada,  IBM India, * Rice University
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki6
GPU Exploitation from Java Program
Why Java for GPU Programming?
 High productivity
– Safety and flexibility (compare to C/C++)
– Good program portability among different machines
 “write once, run anywhere”
– One of the most popular programming languages
 Hard to use CUDA and OpenCL for non-expert programmers
 Many computation-intensive applications in non-HPC area
– Data analytics and data science (Hadoop, Spark, etc.)
– Security analysis
– Natural language processing
8 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
From https://www.flickr.com/photos/dlato/5530553658
CUDA is a programming language
for GPU offered by NVIDIA
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
How We Write GPU Program
 Five steps
1. Allocate GPU device memory
2. Copy data on CPU main memory
to GPU device memory
3. Launch a GPU kernel to be executed
in parallel on cores
4. Copy back data on GPU device
memory to CPU main memory
5. Free GPU device memory
device memory
(up to 16GB)
main memory
(up to 1TB/socket)
CPU GPU
Data copy over
PCIe or NVLink
dozen cores/socket thousands cores
9
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
How We Optimize GPU Program
device memory
(up to 16GB)
main memory
(up to 1TB/socket)
CPU GPUdozen cores/socket thousands cores
10
Exploit faster memory
• Read-only cache (Read only)
• Shared memory (SMEM)
Data copy over
PCIe or NVLink
From GTC presentation by NVIDIA
Reduce data copy
 Five steps
1. Allocate GPU device memory
2. Copy data on CPU main memory
to GPU device memory
3. Launch a GPU kernel to be executed
in parallel on cores
4. Copy back data on GPU device
memory to CPU main memory
5. Free GPU device memory
Fewer Code Makes GPU Programming Easy
 Current programming model requires programmers to
explicitly write operations for
– managing device memories
– copying data
between CPU and GPU
– expressing parallelism
– exploiting faster memory
 Java 8 enables programmers
to just focus on
– expressing parallelism
11 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
void fooCUDA(N, float *A, float *B, int N) {
int sizeN = N * sizeof(float);
cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN);
cudaMemcpy(d_A, A, sizeN, Host2Device);
GPU<<<N, 1>>>(d_A, d_B, N);
cudaMemcpy(B, d_B, sizeN, Device2Host);
cudaFree(d_B); cudaFree(d_A);
}
// code for GPU
__global__ void GPU(float* d_A, float* d_B, int N) {
int i = threadIdx.x;
if (N <= i) return;
d_B[i] = __ldg(&d_A[i]) * 2.0; //__ldg() for read-only cache
}
void fooJava(float A[], float B[], int N) {
// similar to for (idx = 0; i < N; i++)
IntStream.range(0, N).parallel().forEach(i -> {
B[i] = A[i] * 2.0;
});
}
Goal
 Build a Java just-in-time (JIT) compiler to generate high
performance GPU code from a parallel loop construct
 Implementing four performance optimizations
 Offering performance evaluations on POWER8 with a GPU
 Supporting Java language feature (See [PACT2015])
 Predicting Performance on CPU and GPU [PPPJ2015]
 Available in IBM Java 8 ppc64le and x86_64
– https://www.ibm.com/developerworks/java/jdk/java8/
12 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Accomplishments
Parallel Programming in Java 8
 Express parallelism by using Parallel Stream API among
iterations of a lambda expression (index variable: i)
13 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
class Example {
void foo(float[] a, float[] b, float[] c, int n) {
java.util.Stream.IntStream.range(0, n).parallel().forEach(i -> {
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
}
}
Note: Current version supports one-dimensional arrays with primitive types in a lambda expression
Overview of Our JIT Compiler
 Java bytecode
sequence is divided
into two intermediate
presentation (IR) parts
– Lambda expression:
generate GPU code
using NVIDIA tool chain
(right hand side)
– Others:
generate CPU code
using conventional JIT
compiler (left hand side)
14 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
NVIDIA GPU binary
for lambda expression
CPU binary for
- managing device memory
- copying data
- launching GPU binary
Conventional Java JIT compiler
Parallel stream APIs detection
// Parallel stream code
IntStream.range(0, n).parallel()
.forEach(i -> { ...c[i] = a[i]...});
IR for GPUs
...
c[i] = a[i]...
IR for CPUs
Java bytecode
CPU native code
generator GPU native code
generator (by NVIDIA)
Additional modules for GPU
GPUs optimizations
Optimizations for GPU in Our JIT Compiler
 Optimizing alignment of Java arrays on GPUs
– Reduce # of memory transactions to a GPU global memory
 Using read-only cache
– Reduce # of memory transactions to a GPU global memory
 Optimizing data copy between CPU and GPU
– Reduce amount of data copy
 Eliminating redundant exception checks
– Reduce # of instructions in GPU binary
15 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Reducing # of Memory Transactions to GPU Global Memory
 Aligning the starting address of an array body in GPU global
memory with memory transaction boundary
16 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
0 128
a[0]-a[31]
Object header
Memory address
a[32]-a[63]
Naive
alignment
strategy
a[0]-a[31] a[32]-a[63]
256 384
Our
alignment
strategy
One memory transaction for a[0:31]
Two memory transactions for a[0:31]
IntStream.range(0,n).parallel().
forEach(i->{
...= a[i]...; // a[] : float
...;
});
a[64]-a[95]
a[64]-a[95]
A 128-byte memory
transaction boundary
Reducing # of Memory Transactions to GPU Global Memory
 Must keep only a read-only array in a read-only cache
– Lexically different variables (e.g. a[] and b[]) may point to the same array
that may be updated
 Perform alias analysis to identify a read-only array by
– Static analysis in JIT compiler
 identifies lexically read-only arrays and lexically written arrays
– Dynamic alias analysis in generated code
 checks a lexically read-only array that may alias with any lexically written arrays
 executes code with a read-only cache if not aliased
17 Compiling and Optimizing Java 8 Programs for GPU Execution
if (!(a[] aliases with b[]) && !(a[] aliases with c[])) {
IntStream.range(0, n).parallel().forEach( i -> {
b[i] = ROa[i] * 2.0; // use RO cache for a[]
c[i] = ROa[i] * 3.0; // use RO cache for a[]
});
} else {
// execute code w/o a read-only cache
}
IntStream.range(0,n).parallel().
forEach(i->{
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
});
Reducing Amount of Data Copy between CPU and GPU
 Eliminate data copy from GPU if an array (e.g. a[]) is not
updated in GPU binary [Jablin11][Pai12]
 Copy only a read or write set if an array index form is
‘i + constant’ (the set is contiguous)
18 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
sz = (n – 0) * sizeof(float)
cudaMemCopy(&a[0], d_a, sz, H2D); // copy only a read set
cudaMemCopy(&b[0], d_b, sz, H2D);
cudaMemCopy(&c[0], d_c, sz, H2D);
IntStream.range(0, n).parallel().forEach( i -> {
b[i] = a[i]...;
c[i] = a[i]...;
});
cudaMemcpy(a, d_a, sz, D2H);
cudaMemcpy(&b[0], d_b, sz, D2H); // copy only a write set
cudaMemcpy(&c[0], c_b, sz, D2H); // copy only a write set
Eliminating Redundant Exception Checks
 Generate GPU code without exception checks by using
– loop versioning [Artigas00] that guarantees safe region by using pre-
condition checks on CPU
19 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
if (
// check cond. for NullPointerException
a != null && b != null && c != null &&
// check cond. for ArrayIndexOutOfBoundsException
a.length <l n && b.length <l n && c.length <l n) {
...
<<<...>>> GPUbinary(...)
...
} else {
// execute this construct on CPU
// to produce an exception
// under the original exception semantics
}
GPU binary for {
// safe region:
// no exception
// check is required
i = ...;
b[i] = a[i] * 2.0;
c[i] = a[i] * 3.0;
}
IntStream.range(0,n).parallel().
forEach(i->{
b[i] = a[i]...;
c[i] = a[i]...;
});
Automatically Optimized for CPU and GPU
 CPU code
– handles GPU device memory management and data copying
– checks whether optimized CPU and GPU code can be executed
 GPU code
is optimized
– Using
read-only
cache
– Eliminating
exception
checks
20 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
if (a != null && b != null && c != null &&
a.length < n && b.length < n && c.length < n &&
!(a[] aliases with b[]) && !(a[] aliases with c[])) {
cudaMalloc(d_a, a.length*sizeof(float)+128);
if (b!=a) cudaMalloc(d_b, b.length*sizeof(float)+128);
if (c!=a && c!=b) cudaMalloc(d_c, c.length*sizeof(float)+128);
int sz = (n – 0) * sizeof(float), szh = sz + Jhdrsz;
cudaMemCopy(a, d_a + align - Jhdrsz, szh, H2D);
<<...>> GPU(d_a, d_b, d_c, n) // launch GPU
cudaMemcpy(b + Jhdrsz, d_b + align, sz, D2H);
cudaMemcpy(c + Jhdrsz, d_c + align, sz, D2H);
cudaFree(d_a);
if (b!=a) cudaFree(d_b);
if (c=!a && c!=b) cudaFree(d_c);
} else {
// execute CPU binary
}
__global__ void GPU(float *a,
float *b, float *c, int n) {
// no exception checks
i = ...
b[i] = ROa[i] * 2.0;
c[i] = ROa[i] * 3.0;
}
CPU
GPU
Benchmark Programs
 Prepare sequential and parallel stream API versions in Java
21 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Name Summary Data size Type
Blackscholes Financial application that calculates the price of put and call
options
4,194,304 virtual
options
double
MM A standard dense matrix multiplication: C = A.B 1,024 x 1,024 double
Crypt Cryptographic application [Java Grande Benchmarks] N = 50,000,000 byte
Series the first N fourier coefficients of the function [Java Grande
Benchamark]
N = 1,000,000 double
SpMM Sparse matrix multiplication [Java Grande Benchmarks] N = 500,000 double
MRIQ 3D image benchmark for MRI [Parboil benchmarks] 64x64x64 float
Gemm Matrix multiplication: C = α.A.B + β.C [PolyBench] 1,024 x 1,024 int
Gesummv Scalar, vector, and Matrix multiplication [PolyBench] 4,096 x 4,096 int
Performance Improvements of GPU Version
Over Sequential and Parallel CPU Versions
 Achieve 127.9x on geomean and 2067.7x for Series over 1 CPU thread
 Achieve 3.3x on geomean and 32.8x for Series over 160 CPU threads
 Degrade performance for SpMM and Gesummv against 160 CPU threads
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki22
Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memory
with one NVIDIA Kepler K40m GPU at 876 MHz with 12-GB global memory (ECC off)
Ubuntu 14.10, CUDA 5.5
Modified IBM Java 8 runtime for PowerPC
0.85
0.45
1.51
0.92
0.74
0.11
1.19
3.47
0
0.5
1
1.5
2
2.5
3
3.5
4
BlackScholes MM Crypt Series SpMM MRIQ Gemm Gesummv
SpeeduprelativetoCUDA
Performance Comparison with Hand-Coded CUDA
 Achieve 0.83x on geomean over CUDA
 Crypt, Gemm, and Gesummv: usage of a read-only cache
 BlackScholes: usage of larger CUDA threads per block (1024 vs. 128)
 SpMM: overhead of exception checks
 MRIQ: miss of ‘-use-fast-math’ compile option
 MM: lack of usage of shared memory with loop tiling
23 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Higher is better
GPU Version is slower Than Parallel CPU Version
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki24
 Can we choose an appropriate device (CPU or GPU) to avoid
performance degradation?
– Want to make sure to achieve equal or better performance
Machine-learning-based Performance Heuristics
 Construct a binary prediction model offline by supervised
machine learning with support vector machines (SVMs)
– Features
 Loop range
 Dynamic number of instructions (memory access, arithmetic operation, …)
 Dynamic number of array accesses (a[i], a[i + c], a[c * i], a[idx[i]])
 Data transfer size (CPU to GPU, GPU to CPU)
25 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
data1
Bytecode
App A feature 1
Features
extraction
LIBSVM Java
Runtime
Prediction
Model
data2
Bytecode
App A feature 2
Features
extraction
data3
Bytecode
App B feature 3
Features
extraction
CPU GPU
Most Predictions are Correct
Use 291 cases to build model
 Succeeded in predicting cases of performance degradations on GPU
 Failed to predict BlackScholes
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki26
Prediction
      
1.8->1.0 0.8->1.0 0.4->1.0
Related Work
 Our research enables memory and communication
optimizations with machine-learning-based device selection
27 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Work Language Exception
support
JIT
compiler
How to write GPU kernel Data copy
optimization
GPU memory
optimization
Device selection
JCUDA Java × × CUDA Manual Manual GPU only
JaBEE Java × √ Override run method × × GPU only
Aparapi Java × √
Override run
method/Lambda
× × Static
Hadoop-CL Java × √
Override map/reduce
method
× × Static
Rootbeer Java × √ Override run method Not described × Not described
[PPPJ09] Java √ √ Java for-loop Not described ×
Dynamic with
regression
HJ-OpenCL
Habanero-
Java
√ √ Forall constructs √ × Static
Our work Java √ √
Standard parallel
stream API
√
ROCache /
alignment
Dynamic with
machine learning
Future work
 Exploiting shared memory
– Like private memory shared by 64 - 192 cores
 Supporting additional Java operations
28 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
GPU Exploitation in Apache Spark
What is Apache Spark?
 Framework that processes distributed computing by transforming
distributed immutable memory structure using set of parallel operations
 e.g. map(), filter(), reduce(), …
– Distributed immutable in-memory structures
 RDD (Resilient Distributed Dataset), DataFrame, Dataset
– Scala is primary language for programming on Spark
 Provide domain specific libraries
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Spark Runtime (written in Java and Scala)
Spark
Streaming
(real-time)
GraphX
(graph)
SparkSQL
(SQL)
MLlib
(machine
learning)
Java Virtual Machine
tasks Executor
Driver
Executor
results
Executor
Data
Data
Data
Open source: http://spark.apache.org/
Data Source (HDFS, DB, File, etc.)
Latest version is 2.0.3 released in 2016/11
30
How Program Works on Apache Spark
 Parallel operations can be executed among partitions
 In a partition, data can be processed sequentially
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
case class Pt(x: Int, y: Int)
val ds1: Dataset[Pt] = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDS
val ds2: Dataset[Pt] = ds1.map(p => Pt(p.x+1, p.y*2))
val cnt: Int = ds2.reduce((p1, p2) => p1.x + p2.x)
ds1 ds2
p.x+1
p.y*2
p1.x + p2.x
9
5
14
partition
partition
cnt
54
32
+ =
+ =1 5
2 6
partition
pt
partition
31
2 10
3 12
3 7
4 8
4 14
5 16
How We Can Run Program Faster on GPU
 Assign many parallel computations into cores
 Make memory accesses coalesce
– Column-oriented layout results in better performance
 [Che2011] reports on about 3x performance improvement of GPU kernel execution of
kmeans with column-oriented layout over row-oriented layout
1 52 61 5 3 7
Assumption: 4 consecutive data elements
can be coalesced using GPU hardware
2 v.s. 4
memory accesses to
GPU device memory
Row-oriented layoutColumn-oriented layout
Pt(x: Int, y: Int)
Pt(1,5), Pt(2,6), Pt(3,7), Pt(4,8)
Load four Pt.x
Load four Pt.y
2 6 4 843 87
cores
x1 x2 x3 x4
cores
Load Pt.x Load Pt.y Load Pt.x Load Pt.y
1 2 31 2 4
y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki32
Idea to Transparently Exploit GPUs on Apache Spark
 Generate GPU code from a set of parallel operations
– Made it in another research already
 Physically put distributed immutable in-memory structures
(e.g. Dataset) in column-oriented representation
– Dataset is statically typed, but physical layout is not specified in program
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki33
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Overview of GPU Exploitation on Apache Spark
User’s Spark Program
case class Pt(x: Int, y: Int)
ds1 = sc.parallelize(Seq(
Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2)
.toDS
ds2 = ds1.map(p => Pt(p.x+1, p.y*2))
cnt = ds2.reduce((p1, p2) => p1.x + p2.x)
Nativecode
GPU
10
12
14
+1=
*2=
ds1
Data
transfer
x y x y
ds2
partition
GPU
kernel
CPU
16
2
3
4
5
10
12
14
16
2
3
4
5
5
6
1
2
7
8
3
4
5
6
1
2
7
8
3
4
34
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Overview of GPU Exploitation on Apache Spark
 Efficient
– Reduce data copy overhead between CPU and GPU
– Make memory accesses efficient on GPU
 Transparent
– Map parallelism in program
into GPU native code
User’s Spark Program
case class Pt(x: Int, y: Int)
ds1 = sc.parallelize(Seq(
Pt(1, 4), Pt(2, 5), Pt(3, 6), Pt(4, 7)), 2)
.toDS
ds2 = ds1.map(p => Pt(p.x+1, p.y*2))
cnt = ds2.reduce((p1, p2) => p1.x + p2.x)
Drive
GPU native
code
Nativecode
GPU
+1=
*2=
ds1
Data
transfer
x y
GPU manager
Columnar storage
x y
GPU can exploit parallelism both
among partitions in Dataset and
within a partition of Dataset
ds2
partition
GPU
kernel
CPU
Memoryaddress
35
10
12
14
16
2
3
4
5
10
12
14
16
2
3
4
5
5
6
1
2
7
8
3
4
5
6
1
2
7
8
3
4
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
How We Write Program And What is Executed
 Write an operation using a lambda expression for RDD. Then, the corresponding Java
bytecode for the expression is executed.
 Write a program using a relational operation for DataFrame or a lambda expression for
Dataset. Catalyst performs optimization and code generation for the program. Then, the
corresponding Java bytecode for the generated Java code is executed.
ds1 = data.toDS()
ds2 = ds2.map(p => p.x+1)
ds2.reduce((a,b) => a+b)
rdd1 = sc.parallelize(data)
rdd2 = rdd1.map(p => p.x+1)
rdd2.reduce((a,b) => a+b)
df1 = data.toDF(…)
df2 = df2.selectExpr("x+1")
df2.agg(sum())
Frontend
API
DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)
Backend
computationCatalyst-generated Java bytecode Scalac-generated Java bytecode
Java code
Catalyst
1 5 2 6
Java heap
2 61 5
Java heap
Row-oriented Row-oriented Data
data =
Seq(Pt(1, 5),Pt(2, 6))
36
Our Two Implementations for GPU Exploitation
 GPUEnabler is designed for writing domain specific libraries by a Ninja
programmer
– Transparent exploitation by calling a method in the library
 Enhanced Catalyst is designed for writing application by general
programmer
– Transparent exploitation by automatic code generation
Code / Columnar Storage DataFrame/Dataset RDD
Hand-written code GPUEnabler
Automatic code generation Enhanced Catalyst
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
GPU manager
Generated
GPU/SIMD
Pre-compiled
GPU/SIMD code
Spark user program
Columnar storage
Spark runtime
37
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
How Program is Executed on GPU
 For RDD, a programmer provides a pre-compiled GPU code. GPUEnabler handles
data transfer between GPU and CPU and launches the GPU code on GPU
 For DataFrame and Dataset, enhanced Catalyst generates Java code optimized for
GPU. A just-in-time compiler in Java virtual machine can generate GPU code.
ds1 = data.toDS()
ds2 = ds2.map(p => p.x+1)
ds2.reduce((a,b) => a+b)
rdd1 = sc.parallelize(data)
rdd2 = rdd1.map(p => p.x+1, gpu)
rdd2.reduce((a,b) => a+b, gpu)
df1 = data.toDF(…)
df2 = df2.selectExpr("x+1")
df2.agg(sum())
Frontend
API
DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-)
Backend
computationAutomatically generated GPU code Pre-compiled GPU code
Optimized Java code
Enhanced Catalyst
Data2 61 5
GPU device memory
Column-oriented 2 61 5
GPU device memory
Column-oriented
data =
Seq(Pt(1, 5),Pt(2, 6))
GPU Enabler
38
GPUEnabler (https://github.com/IBMSparkGPU/GPUEnabler)
 Use columnar storage for RDD
 Support map & reduce operations to drive GPU code
 Pass GPU code provided by programmer
to an argument of map()/reduce()
 Implemented as a plug-in
# bin/spark-shell --class your.gpu.application yours.jar
--packages com.ibm:gpu-enabler_2.10:1.0.0
// Load a kernel function from the GPU kernel binary
val ptxURL = SparkGPULR.getClass.getResource("/GpuEnablerExamples.ptx")
val mapFunction = new CUDAFunction("multiply2", Array("this"), Array("this"),ptxURL)
val reduceFunction = new CUDAFunction(“sum”, Array(“this”), Array(“this”), ptxURL)
val rdd = sc.parallelize(1 to n)
val output = rdd
.mapExtFunc((x: Int) => x * 2, mapFunction)
.reduceExtFunc((x: Int, y: Int) => x + y, reduceFunction)
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
// GPU code
__global__ void multiply2(int *inX, int *outX, long *size) {
long ix = threadIdx.x + blockIdx.x * blockDim.x;
if (*size <= ix) return;
outX[ix] = inX[ix] * 2;
}
PTX is a common instruction set among
different NVIDIA GPUs defined by NVIDIA
39
Pseudo Java Code by Current Catalyst
 Perform optimization that merges multiple parallel operations
(selectExpr() and agg(sum()) into one loop
int sum = 0
while (rowIterator.hasNext()) {
Row row = rowIterator.next(); // for df1
int x = row.getInteger(0);
// selectExpr(x + 1)
int x_new = x + 1; // for df2
sum += x_new;
}
val df1 = (-1 to 1).toDF("x")
val df2 = df1.selectExpr("x + 1")
df2.agg(sum())
Generated code corresponds to selectExpr() and local sum()
1
3
1
0
-1
-1 0
DataFrame program for Spark
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
20 1
Read sequentially
40
df1
x
x_new
sum
Row-oriented
Catalyst
Generated pseudo Java code
Pseudo Java Code by Enhanced Catalyst
 Get column0 from column-oriented storage
 For-loop can be executed in a reduction manner
Column column0 = df1.getColumn(0); // df1
int sum = 0;
for (int i = 0; i < column0.numRows; i++) {
int x = column0.getInteger(i);
// selectExpr(x + 1)
int x_new = x + 1; // for df2
sum += x_new;
}
1
10-1
-1 0
Generated pseudo Java code
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
3
20 1
41
df1
x
x_new
sum
Column-orientedEnhanced Catalyst
Generate GPU Code Transparently from Spark Program
 Copy column-oriented storage into GPU
 Execute add and reduction in one GPU kernel
Column column0 = df1.getColumn(0);
int nRows = column0.numRows;
cudaMalloc(&d_c0, nRows*4);
cudaMemcpy(d_c0, column0, nRows, H2D);
int sum = 0;
cudaMalloc(&d_sum, 4);
cudaMemcpy(d_c0, &sum, 4, H2D);
<<...>> GPU(d_c0, d_sum, nRows) // launch GPU
cudaMemcpy(d_c0, &sum, 4, D2H);
cudaFree(d_sum); cudaFree(d_c0);
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
val df1 = (-1 to 1).toDF("x")
val df2 = df1.selectExpr("x + 1")
df2.agg(sum())
// GPU code
__global__ void GPU(
int *d_c0, int *d_sum, long size) {
long ix = … // 0, 1, 2
if (size <= ix) return;
int x = d_c0[ix];
int x_new = x + 1;
reduction(d_sum, x_new);
}
42
1
10-1
-1 0
3
20 1
x
x_new
d_sum
d_c0
Execute in parallel
Generated CPU code
Many Engineering Efforts are Required
 Make DataFrame and Dataset column-oriented storage
 Generate simpler optimized code in the while-loop
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki43
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
Very Complicated Java Code by Current Catalyst
 Overhead exists in Java code
– Data representation
– Data conversions
– Complicated code
// source program
val x = Array(1.0, 2.0), y = Array(3.0, 4.0)
val ds = sparkContext.parallelize(Seq(x, y), 1).toDS
ds.map(a => a)
44
a. sparse array to
java.lang.Double[]
b. java.lang.Double[] to double[]
c. double[] to java.lang.Double[]
d. java.lang.Double[] to sparse array
Pretty Simple Java Code by Enhanced Catalyst
 Eliminated most of data conversions are eliminated
 Use data representations suitable for GPU
Dense array to double[]
double[] to dense array
// source program
val x = Array(1.0, 2.0), y = Array(3.0, 4.0)
val ds = sparkContext.parallelize(Seq(x, y), 1).toDS
ds.map(a => a)
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki45
Related Work
 SparkJNI (https://github.com/tudorv91/SparkJNI)
– Call native method from map() or reduce()
– Very similar to GPUEnabler. However, use no columnar storage
 Spark With Accelerated Tasks [Grossman2016]
– Generate GPU code from lambda function in map() in RDD
– Very similar to enhanced Catalyst using columnar storage to transparently
exploit GPUs. However, work for RDD with map()
 GPU Columnar (proposed by Kiran Lonikar)
– Generate GPU code from program using select() method in DataFrame
– Very similar to enhanced Catalyst using columnar storage to transparently
exploit GPUs
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
val inputRDD = cl(sc.objectFile[Int]( hdfsPath ))
val doubledRDD = inputRDD.map(i => 2 * i)
JavaRDD<VectorBean> vectorsRdd = getSparkContext().parallelize(generateVectors(2, 4));
JavaRDD<VectorBean> mulResults = vectorsRdd.map(new VectorMulJni(libPath, "mapVectorMul"));
VectorBean results = mulResults.reduce(new VectorAddJni(libPath, "reduceVectorAdd"));
46
Conclusion
 We generated hardware accelerator code from program with
high-level abstraction
 It is not easy to make them in systematic way
– How can we easily generate optimized code from different types of
domain specific languages?
 Program is cleaner and simpler than twenty-years ago.
 How can we integrate good results in theory into practical system?
 What can we do similar things for deep learning?
– Current deep learning frameworks use GPU by calling libraries (e.g.
cnDNN/cuRNN)
– What are future programming models for deep learning?
Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki47

Weitere ähnliche Inhalte

Was ist angesagt?

SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)Kohei KaiGai
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU deviceKohei KaiGai
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Clusterairbots
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakDatabricks
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learningAdam Gibson
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practicesLior Sidi
 
In-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkIn-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkKazuaki Ishizaki
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...AMD Developer Central
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrKohei KaiGai
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySparkSpark Summit
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_PlaceKohei KaiGai
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStromKohei KaiGai
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021Grigory Sapunov
 

Was ist angesagt? (20)

SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
PG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU devicePG-Strom - A FDW module utilizing GPU device
PG-Strom - A FDW module utilizing GPU device
 
Hadoop + GPU
Hadoop + GPUHadoop + GPU
Hadoop + GPU
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learning
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
In-Memory Evolution in Apache Spark
In-Memory Evolution in Apache SparkIn-Memory Evolution in Apache Spark
In-Memory Evolution in Apache Spark
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
PG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated AsyncrPG-Strom - GPU Accelerated Asyncr
PG-Strom - GPU Accelerated Asyncr
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom20150318-SFPUG-Meetup-PGStrom
20150318-SFPUG-Meetup-PGStrom
 
AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 

Andere mochten auch

20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki publicKazuaki Ishizaki
 
2016 04-19 machine learning
2016 04-19 machine learning2016 04-19 machine learning
2016 04-19 machine learningMark Reynolds
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラKazuaki Ishizaki
 
Machine Learning under Attack: Vulnerability Exploitation and Security Measures
Machine Learning under Attack: Vulnerability Exploitation and Security MeasuresMachine Learning under Attack: Vulnerability Exploitation and Security Measures
Machine Learning under Attack: Vulnerability Exploitation and Security MeasuresPluribus One
 
Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Nicolas Nicolov
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applicationsAnish Das
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_publicKazuaki Ishizaki
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
 
デブサミ2017 Javaコミュニティ作ったら人生変わった
デブサミ2017 Javaコミュニティ作ったら人生変わったデブサミ2017 Javaコミュニティ作ったら人生変わった
デブサミ2017 Javaコミュニティ作ったら人生変わったKoichi Sakata
 

Andere mochten auch (10)

20160906 pplss ishizaki public
20160906 pplss ishizaki public20160906 pplss ishizaki public
20160906 pplss ishizaki public
 
2016 04-19 machine learning
2016 04-19 machine learning2016 04-19 machine learning
2016 04-19 machine learning
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Java Just-In-Timeコンパイラ
Java Just-In-TimeコンパイラJava Just-In-Timeコンパイラ
Java Just-In-Timeコンパイラ
 
Machine Learning under Attack: Vulnerability Exploitation and Security Measures
Machine Learning under Attack: Vulnerability Exploitation and Security MeasuresMachine Learning under Attack: Vulnerability Exploitation and Security Measures
Machine Learning under Attack: Vulnerability Exploitation and Security Measures
 
Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...Machine Learning with Applications in Categorization, Popularity and Sequence...
Machine Learning with Applications in Categorization, Popularity and Sequence...
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
 
20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public20151112 kutech lecture_ishizaki_public
20151112 kutech lecture_ishizaki_public
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
デブサミ2017 Javaコミュニティ作ったら人生変わった
デブサミ2017 Javaコミュニティ作ったら人生変わったデブサミ2017 Javaコミュニティ作ったら人生変わった
デブサミ2017 Javaコミュニティ作ったら人生変わった
 

Ähnlich wie Making Hardware Accelerator Easier to Use

A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.J On The Beach
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...AMD Developer Central
 
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIMAn Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIMjournalBEEI
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
Stream Processing
Stream ProcessingStream Processing
Stream Processingarnamoy10
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyAMD Developer Central
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSA Foundation
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBScyllaDB
 
Labview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLLabview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLMohammad Sabouri
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMchiportal
 
OpenGL Based Testing Tool Architecture for Exascale Computing
OpenGL Based Testing Tool Architecture for Exascale ComputingOpenGL Based Testing Tool Architecture for Exascale Computing
OpenGL Based Testing Tool Architecture for Exascale ComputingCSCJournals
 
Android on IA devices and Intel Tools
Android on IA devices and Intel ToolsAndroid on IA devices and Intel Tools
Android on IA devices and Intel ToolsXavier Hallade
 
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)Igalia
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective HSA Foundation
 
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...Databricks
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirHideki Takase
 

Ähnlich wie Making Hardware Accelerator Easier to Use (20)

A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIMAn Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
An Enhanced FPGA Based Asynchronous Microprocessor Design Using VIVADO and ISIM
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA HSAemu a Full System Emulator for HSA
HSAemu a Full System Emulator for HSA
 
GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
 
A Common Backend for Hardware Acceleration of DSLs on FPGA
A Common Backend for Hardware Acceleration of DSLs on FPGAA Common Backend for Hardware Acceleration of DSLs on FPGA
A Common Backend for Hardware Acceleration of DSLs on FPGA
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
Labview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLLabview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRL
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
OpenGL Based Testing Tool Architecture for Exascale Computing
OpenGL Based Testing Tool Architecture for Exascale ComputingOpenGL Based Testing Tool Architecture for Exascale Computing
OpenGL Based Testing Tool Architecture for Exascale Computing
 
Agnostic Device Drivers
Agnostic Device DriversAgnostic Device Drivers
Agnostic Device Drivers
 
Android on IA devices and Intel Tools
Android on IA devices and Intel ToolsAndroid on IA devices and Intel Tools
Android on IA devices and Intel Tools
 
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
D. Fast, Simple User-Space Network Functions with Snabb (RIPE 77)
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective
 
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
 
Cockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with ElixirCockatrice: A Hardware Design Environment with Elixir
Cockatrice: A Hardware Design Environment with Elixir
 

Mehr von Kazuaki Ishizaki

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdfKazuaki Ishizaki
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdfKazuaki Ishizaki
 
Make AI ecosystem more interoperable
Make AI ecosystem more interoperableMake AI ecosystem more interoperable
Make AI ecosystem more interoperableKazuaki Ishizaki
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Kazuaki Ishizaki
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0Kazuaki Ishizaki
 
Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Kazuaki Ishizaki
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_publicKazuaki Ishizaki
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_publicKazuaki Ishizaki
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_publicKazuaki Ishizaki
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化Kazuaki Ishizaki
 

Mehr von Kazuaki Ishizaki (15)

20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf20230105_TITECH_lecture_ishizaki_public.pdf
20230105_TITECH_lecture_ishizaki_public.pdf
 
20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf20221226_TITECH_lecture_ishizaki_public.pdf
20221226_TITECH_lecture_ishizaki_public.pdf
 
Make AI ecosystem more interoperable
Make AI ecosystem more interoperableMake AI ecosystem more interoperable
Make AI ecosystem more interoperable
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Introduction new features in Spark 3.0
Introduction new features in Spark 3.0Introduction new features in Spark 3.0
Introduction new features in Spark 3.0
 
SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0SQL Performance Improvements At a Glance in Apache Spark 3.0
SQL Performance Improvements At a Glance in Apache Spark 3.0
 
SparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizakiSparkTokyo2019NovIshizaki
SparkTokyo2019NovIshizaki
 
SparkTokyo2019
SparkTokyo2019SparkTokyo2019
SparkTokyo2019
 
icpe2019_ishizaki_public
icpe2019_ishizaki_publicicpe2019_ishizaki_public
icpe2019_ishizaki_public
 
hscj2019_ishizaki_public
hscj2019_ishizaki_publichscj2019_ishizaki_public
hscj2019_ishizaki_public
 
Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0Looking back at Spark 2.x and forward to 3.0
Looking back at Spark 2.x and forward to 3.0
 
20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public20180109 titech lecture_ishizaki_public
20180109 titech lecture_ishizaki_public
 
20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public20171212 titech lecture_ishizaki_public
20171212 titech lecture_ishizaki_public
 
20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public20141224 titech lecture_ishizaki_public
20141224 titech lecture_ishizaki_public
 
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
静的型付き言語用Just-In-Timeコンパイラの再利用による、動的型付き言語用コンパイラの実装と最適化
 

Kürzlich hochgeladen

Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilVICTOR MAESTRE RAMIREZ
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmonyelliciumsolutionspun
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Jaydeep Chhasatia
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadIvo Andreev
 
Introduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntroduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntelliSource Technologies
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesSoftwareMill
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...OnePlan Solutions
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptkinjal48
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024Mind IT Systems
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeNeo4j
 
Enterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze IncEnterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze Incrobinwilliams8624
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesShyamsundar Das
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampVICTOR MAESTRE RAMIREZ
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdfMeon Technology
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxAutus Cyber Tech
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIIvo Andreev
 
Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfBrain Inventory
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionsNirav Modi
 

Kürzlich hochgeladen (20)

Generative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-CouncilGenerative AI for Cybersecurity - EC-Council
Generative AI for Cybersecurity - EC-Council
 
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
 
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
Optimizing Business Potential: A Guide to Outsourcing Engineering Services in...
 
Cybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and BadCybersecurity Challenges with Generative AI - for Good and Bad
Cybersecurity Challenges with Generative AI - for Good and Bad
 
Introduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptxIntroduction-to-Software-Development-Outsourcing.pptx
Introduction-to-Software-Development-Outsourcing.pptx
 
Growing Oxen: channel operators and retries
Growing Oxen: channel operators and retriesGrowing Oxen: channel operators and retries
Growing Oxen: channel operators and retries
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Salesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptxSalesforce AI Associate Certification.pptx
Salesforce AI Associate Certification.pptx
 
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
Transforming PMO Success with AI - Discover OnePlan Strategic Portfolio Work ...
 
Webinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.pptWebinar_050417_LeClair12345666777889.ppt
Webinar_050417_LeClair12345666777889.ppt
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG time
 
Enterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze IncEnterprise Document Management System - Qualityze Inc
Enterprise Document Management System - Qualityze Inc
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security Challenges
 
Deep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - DatacampDeep Learning for Images with PyTorch - Datacamp
Deep Learning for Images with PyTorch - Datacamp
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdf
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptx
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AI
 
Why Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdfWhy Choose Brain Inventory For Ecommerce Development.pdf
Why Choose Brain Inventory For Ecommerce Development.pdf
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspections
 

Making Hardware Accelerator Easier to Use

  • 1. Invited talk at 14th Asian Symposium on Programming Languages and Systems (APLAS 2016) Kazuaki Ishizaki IBM Research – Tokyo Making Hardware Accelerator Easier to Use 1
  • 2. Hanoi in 1996  My first visit to Hanoi – I joined our research project “Java just-in-time compiler” on 1996 – I worked for Parallel Fortran compiler by 1995 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki2
  • 3. Hanoi in 1996 and 2016  Drastically changed over twenty years Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki3 1996 2016
  • 4. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki What has Happened in Computation-Intensive Area for 20 Years  Program is becoming simpler  Hardware is becoming complicated 1996 2016 Hardware Fast scalar processors Commodity processors with hardware accelerators Applications Weather, wind, fluid, and physics simulations, chemical synthesis Machine learning and deep learning with big data Program Complicated and hardware-dependent code Simple and clean code (e.g. mapreduce) Users Limited to programmers who are well-educated for HPC Data scientists who are non-familiar with hardware Hardware Examples4 GPUPowerPC
  • 5. What has Happened in Computation-Intensive Area for 20 Years  Program is becoming simpler  Hardware is becoming complicated Making Hardware Accelerator Easier to Use5 1996 2016 Hardware Fast scalar processors Commodity processors with hardware accelerators Applications Weather, wind, fluid, and physics simulations, chemical synthesis Machine learning and deep learning with big data Program Complicated and hardware- dependent code Simple and clean code (MapReduce) Users Limited to programmers who are well-educated for HPC Data scientists who are non-familiar with hardware Hardware Examples Bad news: Gap between hardware and software is bigger Good news: Program can be easily analyzed
  • 6. My Recent Interest  How system generates hardware accelerator code from program with high-level abstraction – Expected (practical) result  People execute program without knowing usage of hardware accelerator – Challenge  How to optimize code for a certain hardware accelerator without specific information –On-going research  GPU exploitation from Java program  GPU exploitation in Apache Spark work with Akihiro Hayashi *, Alon Shalev Housfater -, Hiroshi Inoue +, Madhusudanan Kandasamy  , Gita Koblents -, Moriyoshi Ohara +, Vivek Sarkar *, and Jan Wroblewski (intern) + + IBM Research – Tokyo, - IBM Canada,  IBM India, * Rice University Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki6
  • 7. GPU Exploitation from Java Program
  • 8. Why Java for GPU Programming?  High productivity – Safety and flexibility (compare to C/C++) – Good program portability among different machines  “write once, run anywhere” – One of the most popular programming languages  Hard to use CUDA and OpenCL for non-expert programmers  Many computation-intensive applications in non-HPC area – Data analytics and data science (Hadoop, Spark, etc.) – Security analysis – Natural language processing 8 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki From https://www.flickr.com/photos/dlato/5530553658 CUDA is a programming language for GPU offered by NVIDIA
  • 9. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki How We Write GPU Program  Five steps 1. Allocate GPU device memory 2. Copy data on CPU main memory to GPU device memory 3. Launch a GPU kernel to be executed in parallel on cores 4. Copy back data on GPU device memory to CPU main memory 5. Free GPU device memory device memory (up to 16GB) main memory (up to 1TB/socket) CPU GPU Data copy over PCIe or NVLink dozen cores/socket thousands cores 9
  • 10. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki How We Optimize GPU Program device memory (up to 16GB) main memory (up to 1TB/socket) CPU GPUdozen cores/socket thousands cores 10 Exploit faster memory • Read-only cache (Read only) • Shared memory (SMEM) Data copy over PCIe or NVLink From GTC presentation by NVIDIA Reduce data copy  Five steps 1. Allocate GPU device memory 2. Copy data on CPU main memory to GPU device memory 3. Launch a GPU kernel to be executed in parallel on cores 4. Copy back data on GPU device memory to CPU main memory 5. Free GPU device memory
  • 11. Fewer Code Makes GPU Programming Easy  Current programming model requires programmers to explicitly write operations for – managing device memories – copying data between CPU and GPU – expressing parallelism – exploiting faster memory  Java 8 enables programmers to just focus on – expressing parallelism 11 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki void fooCUDA(N, float *A, float *B, int N) { int sizeN = N * sizeof(float); cudaMalloc(&d_A, sizeN); cudaMalloc(&d_B, sizeN); cudaMemcpy(d_A, A, sizeN, Host2Device); GPU<<<N, 1>>>(d_A, d_B, N); cudaMemcpy(B, d_B, sizeN, Device2Host); cudaFree(d_B); cudaFree(d_A); } // code for GPU __global__ void GPU(float* d_A, float* d_B, int N) { int i = threadIdx.x; if (N <= i) return; d_B[i] = __ldg(&d_A[i]) * 2.0; //__ldg() for read-only cache } void fooJava(float A[], float B[], int N) { // similar to for (idx = 0; i < N; i++) IntStream.range(0, N).parallel().forEach(i -> { B[i] = A[i] * 2.0; }); }
  • 12. Goal  Build a Java just-in-time (JIT) compiler to generate high performance GPU code from a parallel loop construct  Implementing four performance optimizations  Offering performance evaluations on POWER8 with a GPU  Supporting Java language feature (See [PACT2015])  Predicting Performance on CPU and GPU [PPPJ2015]  Available in IBM Java 8 ppc64le and x86_64 – https://www.ibm.com/developerworks/java/jdk/java8/ 12 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki Accomplishments
  • 13. Parallel Programming in Java 8  Express parallelism by using Parallel Stream API among iterations of a lambda expression (index variable: i) 13 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki class Example { void foo(float[] a, float[] b, float[] c, int n) { java.util.Stream.IntStream.range(0, n).parallel().forEach(i -> { b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; }); } } Note: Current version supports one-dimensional arrays with primitive types in a lambda expression
  • 14. Overview of Our JIT Compiler  Java bytecode sequence is divided into two intermediate presentation (IR) parts – Lambda expression: generate GPU code using NVIDIA tool chain (right hand side) – Others: generate CPU code using conventional JIT compiler (left hand side) 14 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki NVIDIA GPU binary for lambda expression CPU binary for - managing device memory - copying data - launching GPU binary Conventional Java JIT compiler Parallel stream APIs detection // Parallel stream code IntStream.range(0, n).parallel() .forEach(i -> { ...c[i] = a[i]...}); IR for GPUs ... c[i] = a[i]... IR for CPUs Java bytecode CPU native code generator GPU native code generator (by NVIDIA) Additional modules for GPU GPUs optimizations
  • 15. Optimizations for GPU in Our JIT Compiler  Optimizing alignment of Java arrays on GPUs – Reduce # of memory transactions to a GPU global memory  Using read-only cache – Reduce # of memory transactions to a GPU global memory  Optimizing data copy between CPU and GPU – Reduce amount of data copy  Eliminating redundant exception checks – Reduce # of instructions in GPU binary 15 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
  • 16. Reducing # of Memory Transactions to GPU Global Memory  Aligning the starting address of an array body in GPU global memory with memory transaction boundary 16 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki 0 128 a[0]-a[31] Object header Memory address a[32]-a[63] Naive alignment strategy a[0]-a[31] a[32]-a[63] 256 384 Our alignment strategy One memory transaction for a[0:31] Two memory transactions for a[0:31] IntStream.range(0,n).parallel(). forEach(i->{ ...= a[i]...; // a[] : float ...; }); a[64]-a[95] a[64]-a[95] A 128-byte memory transaction boundary
  • 17. Reducing # of Memory Transactions to GPU Global Memory  Must keep only a read-only array in a read-only cache – Lexically different variables (e.g. a[] and b[]) may point to the same array that may be updated  Perform alias analysis to identify a read-only array by – Static analysis in JIT compiler  identifies lexically read-only arrays and lexically written arrays – Dynamic alias analysis in generated code  checks a lexically read-only array that may alias with any lexically written arrays  executes code with a read-only cache if not aliased 17 Compiling and Optimizing Java 8 Programs for GPU Execution if (!(a[] aliases with b[]) && !(a[] aliases with c[])) { IntStream.range(0, n).parallel().forEach( i -> { b[i] = ROa[i] * 2.0; // use RO cache for a[] c[i] = ROa[i] * 3.0; // use RO cache for a[] }); } else { // execute code w/o a read-only cache } IntStream.range(0,n).parallel(). forEach(i->{ b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; });
  • 18. Reducing Amount of Data Copy between CPU and GPU  Eliminate data copy from GPU if an array (e.g. a[]) is not updated in GPU binary [Jablin11][Pai12]  Copy only a read or write set if an array index form is ‘i + constant’ (the set is contiguous) 18 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki sz = (n – 0) * sizeof(float) cudaMemCopy(&a[0], d_a, sz, H2D); // copy only a read set cudaMemCopy(&b[0], d_b, sz, H2D); cudaMemCopy(&c[0], d_c, sz, H2D); IntStream.range(0, n).parallel().forEach( i -> { b[i] = a[i]...; c[i] = a[i]...; }); cudaMemcpy(a, d_a, sz, D2H); cudaMemcpy(&b[0], d_b, sz, D2H); // copy only a write set cudaMemcpy(&c[0], c_b, sz, D2H); // copy only a write set
  • 19. Eliminating Redundant Exception Checks  Generate GPU code without exception checks by using – loop versioning [Artigas00] that guarantees safe region by using pre- condition checks on CPU 19 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki if ( // check cond. for NullPointerException a != null && b != null && c != null && // check cond. for ArrayIndexOutOfBoundsException a.length <l n && b.length <l n && c.length <l n) { ... <<<...>>> GPUbinary(...) ... } else { // execute this construct on CPU // to produce an exception // under the original exception semantics } GPU binary for { // safe region: // no exception // check is required i = ...; b[i] = a[i] * 2.0; c[i] = a[i] * 3.0; } IntStream.range(0,n).parallel(). forEach(i->{ b[i] = a[i]...; c[i] = a[i]...; });
  • 20. Automatically Optimized for CPU and GPU  CPU code – handles GPU device memory management and data copying – checks whether optimized CPU and GPU code can be executed  GPU code is optimized – Using read-only cache – Eliminating exception checks 20 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki if (a != null && b != null && c != null && a.length < n && b.length < n && c.length < n && !(a[] aliases with b[]) && !(a[] aliases with c[])) { cudaMalloc(d_a, a.length*sizeof(float)+128); if (b!=a) cudaMalloc(d_b, b.length*sizeof(float)+128); if (c!=a && c!=b) cudaMalloc(d_c, c.length*sizeof(float)+128); int sz = (n – 0) * sizeof(float), szh = sz + Jhdrsz; cudaMemCopy(a, d_a + align - Jhdrsz, szh, H2D); <<...>> GPU(d_a, d_b, d_c, n) // launch GPU cudaMemcpy(b + Jhdrsz, d_b + align, sz, D2H); cudaMemcpy(c + Jhdrsz, d_c + align, sz, D2H); cudaFree(d_a); if (b!=a) cudaFree(d_b); if (c=!a && c!=b) cudaFree(d_c); } else { // execute CPU binary } __global__ void GPU(float *a, float *b, float *c, int n) { // no exception checks i = ... b[i] = ROa[i] * 2.0; c[i] = ROa[i] * 3.0; } CPU GPU
  • 21. Benchmark Programs  Prepare sequential and parallel stream API versions in Java 21 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki Name Summary Data size Type Blackscholes Financial application that calculates the price of put and call options 4,194,304 virtual options double MM A standard dense matrix multiplication: C = A.B 1,024 x 1,024 double Crypt Cryptographic application [Java Grande Benchmarks] N = 50,000,000 byte Series the first N fourier coefficients of the function [Java Grande Benchamark] N = 1,000,000 double SpMM Sparse matrix multiplication [Java Grande Benchmarks] N = 500,000 double MRIQ 3D image benchmark for MRI [Parboil benchmarks] 64x64x64 float Gemm Matrix multiplication: C = α.A.B + β.C [PolyBench] 1,024 x 1,024 int Gesummv Scalar, vector, and Matrix multiplication [PolyBench] 4,096 x 4,096 int
  • 22. Performance Improvements of GPU Version Over Sequential and Parallel CPU Versions  Achieve 127.9x on geomean and 2067.7x for Series over 1 CPU thread  Achieve 3.3x on geomean and 32.8x for Series over 160 CPU threads  Degrade performance for SpMM and Gesummv against 160 CPU threads Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki22 Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memory with one NVIDIA Kepler K40m GPU at 876 MHz with 12-GB global memory (ECC off) Ubuntu 14.10, CUDA 5.5 Modified IBM Java 8 runtime for PowerPC
  • 23. 0.85 0.45 1.51 0.92 0.74 0.11 1.19 3.47 0 0.5 1 1.5 2 2.5 3 3.5 4 BlackScholes MM Crypt Series SpMM MRIQ Gemm Gesummv SpeeduprelativetoCUDA Performance Comparison with Hand-Coded CUDA  Achieve 0.83x on geomean over CUDA  Crypt, Gemm, and Gesummv: usage of a read-only cache  BlackScholes: usage of larger CUDA threads per block (1024 vs. 128)  SpMM: overhead of exception checks  MRIQ: miss of ‘-use-fast-math’ compile option  MM: lack of usage of shared memory with loop tiling 23 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki Higher is better
  • 24. GPU Version is slower Than Parallel CPU Version Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki24  Can we choose an appropriate device (CPU or GPU) to avoid performance degradation? – Want to make sure to achieve equal or better performance
  • 25. Machine-learning-based Performance Heuristics  Construct a binary prediction model offline by supervised machine learning with support vector machines (SVMs) – Features  Loop range  Dynamic number of instructions (memory access, arithmetic operation, …)  Dynamic number of array accesses (a[i], a[i + c], a[c * i], a[idx[i]])  Data transfer size (CPU to GPU, GPU to CPU) 25 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki data1 Bytecode App A feature 1 Features extraction LIBSVM Java Runtime Prediction Model data2 Bytecode App A feature 2 Features extraction data3 Bytecode App B feature 3 Features extraction CPU GPU
  • 26. Most Predictions are Correct Use 291 cases to build model  Succeeded in predicting cases of performance degradations on GPU  Failed to predict BlackScholes Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki26 Prediction        1.8->1.0 0.8->1.0 0.4->1.0
  • 27. Related Work  Our research enables memory and communication optimizations with machine-learning-based device selection 27 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki Work Language Exception support JIT compiler How to write GPU kernel Data copy optimization GPU memory optimization Device selection JCUDA Java × × CUDA Manual Manual GPU only JaBEE Java × √ Override run method × × GPU only Aparapi Java × √ Override run method/Lambda × × Static Hadoop-CL Java × √ Override map/reduce method × × Static Rootbeer Java × √ Override run method Not described × Not described [PPPJ09] Java √ √ Java for-loop Not described × Dynamic with regression HJ-OpenCL Habanero- Java √ √ Forall constructs √ × Static Our work Java √ √ Standard parallel stream API √ ROCache / alignment Dynamic with machine learning
  • 28. Future work  Exploiting shared memory – Like private memory shared by 64 - 192 cores  Supporting additional Java operations 28 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki
  • 29. GPU Exploitation in Apache Spark
  • 30. What is Apache Spark?  Framework that processes distributed computing by transforming distributed immutable memory structure using set of parallel operations  e.g. map(), filter(), reduce(), … – Distributed immutable in-memory structures  RDD (Resilient Distributed Dataset), DataFrame, Dataset – Scala is primary language for programming on Spark  Provide domain specific libraries Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki Spark Runtime (written in Java and Scala) Spark Streaming (real-time) GraphX (graph) SparkSQL (SQL) MLlib (machine learning) Java Virtual Machine tasks Executor Driver Executor results Executor Data Data Data Open source: http://spark.apache.org/ Data Source (HDFS, DB, File, etc.) Latest version is 2.0.3 released in 2016/11 30
  • 31. How Program Works on Apache Spark  Parallel operations can be executed among partitions  In a partition, data can be processed sequentially Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki case class Pt(x: Int, y: Int) val ds1: Dataset[Pt] = sc.parallelize(Seq(Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2).toDS val ds2: Dataset[Pt] = ds1.map(p => Pt(p.x+1, p.y*2)) val cnt: Int = ds2.reduce((p1, p2) => p1.x + p2.x) ds1 ds2 p.x+1 p.y*2 p1.x + p2.x 9 5 14 partition partition cnt 54 32 + = + =1 5 2 6 partition pt partition 31 2 10 3 12 3 7 4 8 4 14 5 16
  • 32. How We Can Run Program Faster on GPU  Assign many parallel computations into cores  Make memory accesses coalesce – Column-oriented layout results in better performance  [Che2011] reports on about 3x performance improvement of GPU kernel execution of kmeans with column-oriented layout over row-oriented layout 1 52 61 5 3 7 Assumption: 4 consecutive data elements can be coalesced using GPU hardware 2 v.s. 4 memory accesses to GPU device memory Row-oriented layoutColumn-oriented layout Pt(x: Int, y: Int) Pt(1,5), Pt(2,6), Pt(3,7), Pt(4,8) Load four Pt.x Load four Pt.y 2 6 4 843 87 cores x1 x2 x3 x4 cores Load Pt.x Load Pt.y Load Pt.x Load Pt.y 1 2 31 2 4 y1 y2 y3 y4 x1 x2 x3 x4 y1 y2 y3 y4 Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki32
  • 33. Idea to Transparently Exploit GPUs on Apache Spark  Generate GPU code from a set of parallel operations – Made it in another research already  Physically put distributed immutable in-memory structures (e.g. Dataset) in column-oriented representation – Dataset is statically typed, but physical layout is not specified in program Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki33
  • 34. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki Overview of GPU Exploitation on Apache Spark User’s Spark Program case class Pt(x: Int, y: Int) ds1 = sc.parallelize(Seq( Pt(1, 5), Pt(2, 6), Pt(3, 7), Pt(4, 8)), 2) .toDS ds2 = ds1.map(p => Pt(p.x+1, p.y*2)) cnt = ds2.reduce((p1, p2) => p1.x + p2.x) Nativecode GPU 10 12 14 +1= *2= ds1 Data transfer x y x y ds2 partition GPU kernel CPU 16 2 3 4 5 10 12 14 16 2 3 4 5 5 6 1 2 7 8 3 4 5 6 1 2 7 8 3 4 34
  • 35. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki Overview of GPU Exploitation on Apache Spark  Efficient – Reduce data copy overhead between CPU and GPU – Make memory accesses efficient on GPU  Transparent – Map parallelism in program into GPU native code User’s Spark Program case class Pt(x: Int, y: Int) ds1 = sc.parallelize(Seq( Pt(1, 4), Pt(2, 5), Pt(3, 6), Pt(4, 7)), 2) .toDS ds2 = ds1.map(p => Pt(p.x+1, p.y*2)) cnt = ds2.reduce((p1, p2) => p1.x + p2.x) Drive GPU native code Nativecode GPU +1= *2= ds1 Data transfer x y GPU manager Columnar storage x y GPU can exploit parallelism both among partitions in Dataset and within a partition of Dataset ds2 partition GPU kernel CPU Memoryaddress 35 10 12 14 16 2 3 4 5 10 12 14 16 2 3 4 5 5 6 1 2 7 8 3 4 5 6 1 2 7 8 3 4
  • 36. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki How We Write Program And What is Executed  Write an operation using a lambda expression for RDD. Then, the corresponding Java bytecode for the expression is executed.  Write a program using a relational operation for DataFrame or a lambda expression for Dataset. Catalyst performs optimization and code generation for the program. Then, the corresponding Java bytecode for the generated Java code is executed. ds1 = data.toDS() ds2 = ds2.map(p => p.x+1) ds2.reduce((a,b) => a+b) rdd1 = sc.parallelize(data) rdd2 = rdd1.map(p => p.x+1) rdd2.reduce((a,b) => a+b) df1 = data.toDF(…) df2 = df2.selectExpr("x+1") df2.agg(sum()) Frontend API DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-) Backend computationCatalyst-generated Java bytecode Scalac-generated Java bytecode Java code Catalyst 1 5 2 6 Java heap 2 61 5 Java heap Row-oriented Row-oriented Data data = Seq(Pt(1, 5),Pt(2, 6)) 36
  • 37. Our Two Implementations for GPU Exploitation  GPUEnabler is designed for writing domain specific libraries by a Ninja programmer – Transparent exploitation by calling a method in the library  Enhanced Catalyst is designed for writing application by general programmer – Transparent exploitation by automatic code generation Code / Columnar Storage DataFrame/Dataset RDD Hand-written code GPUEnabler Automatic code generation Enhanced Catalyst Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki GPU manager Generated GPU/SIMD Pre-compiled GPU/SIMD code Spark user program Columnar storage Spark runtime 37
  • 38. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki How Program is Executed on GPU  For RDD, a programmer provides a pre-compiled GPU code. GPUEnabler handles data transfer between GPU and CPU and launches the GPU code on GPU  For DataFrame and Dataset, enhanced Catalyst generates Java code optimized for GPU. A just-in-time compiler in Java virtual machine can generate GPU code. ds1 = data.toDS() ds2 = ds2.map(p => p.x+1) ds2.reduce((a,b) => a+b) rdd1 = sc.parallelize(data) rdd2 = rdd1.map(p => p.x+1, gpu) rdd2.reduce((a,b) => a+b, gpu) df1 = data.toDF(…) df2 = df2.selectExpr("x+1") df2.agg(sum()) Frontend API DataFrame (v1.3-) Dataset (v1.6-) RDD (v0.5-) Backend computationAutomatically generated GPU code Pre-compiled GPU code Optimized Java code Enhanced Catalyst Data2 61 5 GPU device memory Column-oriented 2 61 5 GPU device memory Column-oriented data = Seq(Pt(1, 5),Pt(2, 6)) GPU Enabler 38
  • 39. GPUEnabler (https://github.com/IBMSparkGPU/GPUEnabler)  Use columnar storage for RDD  Support map & reduce operations to drive GPU code  Pass GPU code provided by programmer to an argument of map()/reduce()  Implemented as a plug-in # bin/spark-shell --class your.gpu.application yours.jar --packages com.ibm:gpu-enabler_2.10:1.0.0 // Load a kernel function from the GPU kernel binary val ptxURL = SparkGPULR.getClass.getResource("/GpuEnablerExamples.ptx") val mapFunction = new CUDAFunction("multiply2", Array("this"), Array("this"),ptxURL) val reduceFunction = new CUDAFunction(“sum”, Array(“this”), Array(“this”), ptxURL) val rdd = sc.parallelize(1 to n) val output = rdd .mapExtFunc((x: Int) => x * 2, mapFunction) .reduceExtFunc((x: Int, y: Int) => x + y, reduceFunction) Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki // GPU code __global__ void multiply2(int *inX, int *outX, long *size) { long ix = threadIdx.x + blockIdx.x * blockDim.x; if (*size <= ix) return; outX[ix] = inX[ix] * 2; } PTX is a common instruction set among different NVIDIA GPUs defined by NVIDIA 39
  • 40. Pseudo Java Code by Current Catalyst  Perform optimization that merges multiple parallel operations (selectExpr() and agg(sum()) into one loop int sum = 0 while (rowIterator.hasNext()) { Row row = rowIterator.next(); // for df1 int x = row.getInteger(0); // selectExpr(x + 1) int x_new = x + 1; // for df2 sum += x_new; } val df1 = (-1 to 1).toDF("x") val df2 = df1.selectExpr("x + 1") df2.agg(sum()) Generated code corresponds to selectExpr() and local sum() 1 3 1 0 -1 -1 0 DataFrame program for Spark Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki 20 1 Read sequentially 40 df1 x x_new sum Row-oriented Catalyst Generated pseudo Java code
  • 41. Pseudo Java Code by Enhanced Catalyst  Get column0 from column-oriented storage  For-loop can be executed in a reduction manner Column column0 = df1.getColumn(0); // df1 int sum = 0; for (int i = 0; i < column0.numRows; i++) { int x = column0.getInteger(i); // selectExpr(x + 1) int x_new = x + 1; // for df2 sum += x_new; } 1 10-1 -1 0 Generated pseudo Java code Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki 3 20 1 41 df1 x x_new sum Column-orientedEnhanced Catalyst
  • 42. Generate GPU Code Transparently from Spark Program  Copy column-oriented storage into GPU  Execute add and reduction in one GPU kernel Column column0 = df1.getColumn(0); int nRows = column0.numRows; cudaMalloc(&d_c0, nRows*4); cudaMemcpy(d_c0, column0, nRows, H2D); int sum = 0; cudaMalloc(&d_sum, 4); cudaMemcpy(d_c0, &sum, 4, H2D); <<...>> GPU(d_c0, d_sum, nRows) // launch GPU cudaMemcpy(d_c0, &sum, 4, D2H); cudaFree(d_sum); cudaFree(d_c0); Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki val df1 = (-1 to 1).toDF("x") val df2 = df1.selectExpr("x + 1") df2.agg(sum()) // GPU code __global__ void GPU( int *d_c0, int *d_sum, long size) { long ix = … // 0, 1, 2 if (size <= ix) return; int x = d_c0[ix]; int x_new = x + 1; reduction(d_sum, x_new); } 42 1 10-1 -1 0 3 20 1 x x_new d_sum d_c0 Execute in parallel Generated CPU code
  • 43. Many Engineering Efforts are Required  Make DataFrame and Dataset column-oriented storage  Generate simpler optimized code in the while-loop Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki43
  • 44. Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki Very Complicated Java Code by Current Catalyst  Overhead exists in Java code – Data representation – Data conversions – Complicated code // source program val x = Array(1.0, 2.0), y = Array(3.0, 4.0) val ds = sparkContext.parallelize(Seq(x, y), 1).toDS ds.map(a => a) 44 a. sparse array to java.lang.Double[] b. java.lang.Double[] to double[] c. double[] to java.lang.Double[] d. java.lang.Double[] to sparse array
  • 45. Pretty Simple Java Code by Enhanced Catalyst  Eliminated most of data conversions are eliminated  Use data representations suitable for GPU Dense array to double[] double[] to dense array // source program val x = Array(1.0, 2.0), y = Array(3.0, 4.0) val ds = sparkContext.parallelize(Seq(x, y), 1).toDS ds.map(a => a) Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki45
  • 46. Related Work  SparkJNI (https://github.com/tudorv91/SparkJNI) – Call native method from map() or reduce() – Very similar to GPUEnabler. However, use no columnar storage  Spark With Accelerated Tasks [Grossman2016] – Generate GPU code from lambda function in map() in RDD – Very similar to enhanced Catalyst using columnar storage to transparently exploit GPUs. However, work for RDD with map()  GPU Columnar (proposed by Kiran Lonikar) – Generate GPU code from program using select() method in DataFrame – Very similar to enhanced Catalyst using columnar storage to transparently exploit GPUs Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki val inputRDD = cl(sc.objectFile[Int]( hdfsPath )) val doubledRDD = inputRDD.map(i => 2 * i) JavaRDD<VectorBean> vectorsRdd = getSparkContext().parallelize(generateVectors(2, 4)); JavaRDD<VectorBean> mulResults = vectorsRdd.map(new VectorMulJni(libPath, "mapVectorMul")); VectorBean results = mulResults.reduce(new VectorAddJni(libPath, "reduceVectorAdd")); 46
  • 47. Conclusion  We generated hardware accelerator code from program with high-level abstraction  It is not easy to make them in systematic way – How can we easily generate optimized code from different types of domain specific languages?  Program is cleaner and simpler than twenty-years ago.  How can we integrate good results in theory into practical system?  What can we do similar things for deep learning? – Current deep learning frameworks use GPU by calling libraries (e.g. cnDNN/cuRNN) – What are future programming models for deep learning? Making Hardware Accelerator Easier to Use / Kazuaki Ishizaki47