SlideShare ist ein Scribd-Unternehmen logo
1 von 88
Downloaden Sie, um offline zu lesen
© 2017 IBM Corporation
1
●
Sharing observations and our progress
●
How to get started
●
The good and the bad
●
Interesting related projects
●
Plenty of code to show you
●
Tips for avoiding common problems
Massive parallelism with GPUs
in Java 8
Adam Roberts
IBM Runtimes, Hursley, UK
© 2017 IBM Corporation
Important disclaimers
Copyright © 2017 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form
without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy
as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this
information. THIS document is distributed "AS IS" without any warranty, either express or implied. In no event shall IBM be liable for any damage arising
from the use of this information, including but not limited to, loss of data, business interruption, loss of profit or loss of opportunity. IBM products and
services are warranted according to the terms and conditions of the agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of
how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other
operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services
available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM.
All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or
advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the
identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer
may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the 
customer is in compliance with any law.
Information within this presentation is accurate to the best of the author's knowledge as of the 17th
of March 2017
© 2017 IBM Corporation
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or
other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM
products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or
the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or
implied, including but not limited to, the implied warranties of merchantability and fitness for a particular purpose.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents,
copyrights, trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document Management System™,
Global Business Services ®, Global Technology Services ®, Information on Demand, ILOG, LinuxONE™, Maximo®,
MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™,
PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®,
Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z®
Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other
product and service names might be trademarks of IBM or other companies.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective
owners: oher names mentioned here include AMD, Nvidia, Tensorflow. Aparapi, Jcuda, cuDNN, cuBLAS, Project Sumatra,
OpenJDK, CERN, Geant, AlphaGo, Oak Ridge, Titan, Lenovo, Tesla, Netflix, Rice University, Devoxx, DeepLearning4j. A current list
of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
Databricks is a registered trademark of Databricks, Inc.
Apache Spark, Apache Cassandra, Apache Hadoop, Apache Maven, Spark, Apache, any other Apache project mentioned here
and the Apache product logos including the Spark logo are trademarks of The Apache Software Foundation.
© 2017 IBM Corporation
1) No liability accepted for any of the code I'll be sharing today and
providing with this presentation at the end – to be used at your own
risk! My sample code isn't for production use – I've skimmed on plenty
of application hardening techniques (checking error codes, using final,
correct visibility modifiers etc)
2) Experiments with lots of threads meaning potentially lots of
problems -really- don't run the parallel example trying to use 50,000+
CPU threads coming up – and if we make it to the end, don't run the
“kernelception” program either
3) Messing around with graphics drivers on your work laptop isn't a
good idea unless you have a good backup in place (my laptop was a
headless server for a few days), I changed BIOS settings and made a
few mistakes along the way – you have been warned!
© 2017 IBM Corporation
5
What I won't cover
✗
In-depth look at alternatives, I'll be talking about Nvidia's
CUDA with IBM's SDK for Java mainly
✗
In-depth debugging and profiling
✗
Real impressive applications – I'll be talking about how to get
started to give you ideas
✗
GPUs may be a useful fit for that simple processing task
you run on large amounts of data
✗
Java basics – assuming you know about Java options, building
and running and you're now interested in doing lots of
operations at once as fast as possible
© 2017 IBM Corporation
6
z13
BigInsights
How popular is Java?
© 2017 IBM Corporation
7
Simple Java only example inspired by a stackoverflow post titled “custom thread pool in
Java 8 parallel stream”: the goal here isn't to finish processing elements fast – it's to see
how many threads reportedly get run in parallel and to see how many threads I can
specify to use before the JVM crashes
How many threads can we run at once in Java?
•
•
© 2017 IBM Corporation
8
•
numElements set to 50, numThreads set to 5
Takes ten seconds to finish, no problems
© 2017 IBM Corporation
9
numElements set to 50, numThreads set to 50 also
Finishes instantly – no problems
© 2017 IBM Corporation
10
No problems here either...
numElements set to 50,000, numThreads reduced to 10
© 2017 IBM Corporation
11
Faster, constant output, still no problems, laptop getting noisy now...
numElements set to 50,000, numThreads set to 1,000...
© 2017 IBM Corporation
12
numElements 50,000, numThreads 50,000?
✔
Laptop preparing to take off from my desk
✔
No native memory to create new threads
✔
Unable to terminate the process in my shell - ^C's – they do nothing!
✔
Mouse stuttering around...can't...click...the x...now curious what happens
✔
JVM trying to create coredumps, javacores repeatedly: trying to eat up my
disk space - no memory to create those anyway
✔
LibreOffice crashes, lost unsaved work (past experiments needing to be
redone)
✔
Still can't ctrl-c to stop everything
✔
Can't launch any new processes (no chance of launching system monitor)
✔
Wanted to get a printscreen – no memory available for that either - reboot
© 2017 IBM Corporation
13
Reaching out to GPU(s) for more processing
power from Java
•
We'll struggle trying to run thousands of threads at once in one JVM
(using a single machine and a single CPU with many cores), but using
GPUs can sometimes be of use
Use cases for GPUs share typically share these common themes, we
want to:
●
Achieve results fast
●
Execute many threads to quickly process data for my “easily* parallelisable”
operations
●
Handle large amounts of data
Great for machine learning: quickly compute and store models to use later
© 2017 IBM Corporation
14
AlphaGo beating a Go champion:
1,202 CPUs, 176 GPUs
Titan: 18,688 GPUs, 18,688 CPUs
CERN: reported to be using GPUs
Oak Ridge, IBM “the world's fastest supercomputers by 2017”: two, $325m
Databricks: recent blog post mentions deep learning with GPUs and Spark
Who's using GPUs already? Only public
knowledge provided here, certainly
many more than this!
© 2017 IBM Corporation
15
Recent AI vs Poker win (from top500: “bridges-supercomputer” article
here: mentions using 64 Nvidia P100 GPUs!
Recent Amazon cloud offering: GPUs as a service
Nvidia email as part of the accelerated computing newsletter
mentions…
●
Deep learning to combat asteroids
●
Detecting road lanes with deep learning
●
Algorithm to identify skin cancer
●
Lip reading AI more accurate than humans
●
Life-changing wearable for the blind
Lots more success stories – what makes a GPU useful, worth the hype?
© 2017 IBM Corporation
16
GPUs excel at executing many of the same operations at once (Single
Instruction Multiple Data programming)
We'll program using CUDA or OpenCL – like C and C++ but not quite the
same (nuances like <<< and >>> for kernels in CUDA) and we can write JNI
code to access data in our Java world using the GPU
We'll run code on computers that are shipped with graphics cards, there
are free CUDA drivers for x86-64 Windows, Linux, and IBM's Power LE,
OpenCL drivers, SDK and source also widely available
CPUGPU
© 2017 IBM Corporation
17
What types of GPU can I get? Does it matter?
“Graphics adapters” you can plug a monitor into
2 to 4 GB~ GDDR5 memory
< a thousand processing cores
Clock speed ~ 1250 mhz
Typical in laptops, desktop gaming computers
* For this presentation, experiments (unless otherwise stated)
were performed on my Lenovo p50 laptop (discrete graphics
mode set in the BIOS, CUDA 7.5, RHEL 7.3, 32 GB RAM,
M1000M Quadro GPU, 8 core Intel(R) Core(TM) i7-6820HQ
CPU @ 2.70GHz processor, ext4 filesystem)
© 2017 IBM Corporation
18
HPC cards like the Tesla series
GDDR5 memory - typically 8G to 24G
1-5 thousand processing cores
Offering teraflops of performance
~500 GB/sec max memory bandwidth*
Remember you're going to be limited by the PCIe bus if it's between CPU and
GPU, for CUDA devices, use deviceQuery, bandwidthTest applications)
300W~ thermal design power rating
© 2017 IBM Corporation
19
Running deviceQuery on my development laptop
Application provided with CUDA samples from Nvidia
Device 0: "Quadro M1000M"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2047 MBytes (2146762752 bytes)
( 4) Multiprocessors, (128) CUDA Cores/MP: 512 CUDA Cores
GPU Max Clock rate: 1072 MHz (1.07 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Quadro M1000M
Result = PASS
© 2017 IBM Corporation
20
Running bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Quadro M1000M
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12152.3
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12225.6
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 66464.2
Device to device is quick
but the host and device
interchange is far slower
Compare this to direct
memory access…
© 2017 IBM Corporation
21
CPU, RAM, OS, kernel info
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Core(TM)
i7-6820HQ CPU, 2.70GHz
Stepping: 3
CPU MHz: 899.964
BogoMIPS: 5424.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
cat /proc/meminfo
MemTotal: 32391628 kB
uname -a
Linux devoxx 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat
Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64
GNU/Linux
Df -h
Filesystem Size Used
Avail Use% Mounted on
/dev/mapper/vg_oc2660338613-lv_root 438G 81G
336G 20% /
(it's an SSD)
cat /etc/*-release
Linux Client for e-business (RHEL) 7.3
Open Client RHEL 7 4.30
NAME="Red Hat Enterprise Linux Workstation"
VERSION="7.3 (Maipo)"
© 2017 IBM Corporation
22
Workload characteristics a GPU can excel at
Data
●
We can process lots of primitive types at once
●
ints, longs, doubles, shorts, floats – perhaps used in...
●
Matrix multiplication (dot product for ML?)
●
Simple transforms (change our masses of longs by a known
offset amount?)
●
Find a pattern in the data: count occurrences of a certain
string from Wikipedia dumps
Operations
●
Keep it simple – without branching and complexity
●
Great for arithmetic ops (very fast floating point ops...)
© 2017 IBM Corporation
23
Workload characteristics a GPU won't be good for
Data
●
The data we need isn't “self contained” – we can't send down one
whole block of data and get meaningful results as we depend on data
elsewhere...lots of copying back and forth
Operations
●
Non-arithmetic based – code that touches files, uses the network,
manipulates objects...stick to the maths
●
Involves new object creation or throwing exceptions – more on this
later…
●
Using threads for different instructions simultaneously – try to keep it
simple without lots of if/elses
© 2017 IBM Corporation
24
Assume we have an integer array declared in a .cu file called myData (either
initialized with malloc or on the stack – a regular C style array)
1) Declare a new variable of the same type e.g. int* myDataOnGPU
2) Allocate space on the GPU (device side) using cudaMalloc passing in the
address of myDataOnGPU and how many bytes to reserve as a parameter (e.g.
cudaMalloc(&myDataOnGPU), 400)
3) Copy myData from the host to your allocated space (myDataOnGPU) using
cudaMemcpyHostToDevice
4) Process your data on the GPU in a kernel (we use <<< and >>>)
5) Copy the result back (what's at myDataOnGPU replaces myData on the
host) using cudaMemcpyDeviceToHost
How do we use a GPU – basic principles to know
© 2017 IBM Corporation
25
__global__ void addingKernel(int* array1, int* array2){
array1[threadIdx.x] += array2[threadIdx.x];
}
__global__ : it's a function we can call on the host (CPU), it's available to
be called from everywhere. __device__ and __host__ also exist
How is the data arranged and how can I access it?
Sequentially, a kernel runs on a grid (numBlocks X numThreads) and this is
how we can run many threads that work on different parts of the data
Int* is a regular pointer to integers we've copied to the GPU
threadIdx.x: We can use this built-in variable inside of kernels an index to
our array, remember lots of threads run on the GPU and this can be our
way to access each unique item – if we run a kernel <<<1, 256>>>, that
means one block and 256 threads will run each time you call the kernel
© 2017 IBM Corporation
26
Multiprocessors (also known as streaming multiprocessors or stream processors):
these execute one or more thread blocks
CUDA core: they execute the threads themselves
Threads on a GPU: many more are available than with CPUs and these are
organised into the blocks
Kernel: a function we'll run on the GPU
How many threads can I really run at once?
Multiprocessor count X their limit
e.g. 4 * 2048 with 512 CUDA cores for me
A Tesla K80m has 26 multiprocessors and 4992 CUDA cores (2496 per GPU), 2048
threads per multiprocessor also. Other threads wait to be executed
© 2017 IBM Corporation
27
All kernels must be launched with grid dimensions specified
Grid: logical 3d representation of how threads can be run on a given GPU – a
kernel runs on a grid. This grid has potentially many blocks with threads organised
“inside” each block (actually get run on the MP)
Our GPU functions (kernels) run on one of these grids and the dimensions include
how many blocks and threads a kernel should run
The nvidia-smi command tells you about your GPU's limits – know these to
prevent launch configuration problems
A good starting point is to pick 512 for the number of threads and the number of
blocks varies depending on your problem size – then launch multiple kernels in a
tight loop modifying the offset to operate on different portions of the data
How much do I need to know?
© 2017 IBM Corporation
28
An example to do that exactly that using Java:
int log2BlockDim = 9;
int numBlocks = (numElements + 511) >> log2BlockDim;
int numThreads = 1 << log2BlockDim;
Size Blocks Threads
500 1 512
1,024 2 512
32,000 63 512
64,000 125 512
100,000 196 512
512,000 1000 512
1,024,000 2000 512
© 2017 IBM Corporation
29
Show me the code – simplest example, only CUDA#include <cuda.h>
#include <stdio.h>
const int NUM_ELEMENTS = 5;
__global__ void addToMe(int* someInts, int amountToAdd) {
someInts[threadIdx.x] += amountToAdd;
}
// This is in foo.cu → nvcc foo.cu → ./a.out
int main() {
int* myHostInts = (int*) malloc(sizeof(int) * NUM_ELEMENTS);
for (int i = 0; i < NUM_ELEMENTS; i++) {
myHostInts[i] = i;
}
int* myDeviceInts;
const int numBytes = NUM_ELEMENTS * sizeof(int);
cudaMalloc(&myDeviceInts, numBytes);
cudaMemcpy(myDeviceInts, myHostInts, numBytes, cudaMemcpyHostToDevice);
int numBlocks = (NUM_ELEMENTS / 256) + 1;
addToMe<<<numBlocks, 256>>>(myDeviceInts, 10);
cudaMemcpy(myHostInts, myDeviceInts, numBytes,
cudaMemcpyDeviceToHost);
// Tidy up after ourselves as good practice
cudaFree(myDeviceInts);
return EXIT_SUCCESS;
}
No bounds checking! Not
required but can lead to
problems later
Printing threadIdx.x here will
print 0 to 255
Blocks = a group of threads
I'll use a 2D grid (just lots of
blocks/threads) in this
presentation
Look at our kernel dimensions
numBlocks will be 1
256 is the number of threads
This is how we control how
many threads to run
© 2017 IBM Corporation
30
●
[Java] We have an integer array on the Java heap: myData – we want to process it
somehow using a GPU
●
[Java] Create a native method (Java/Scala): no body required
●
[JNI] Write .cpp or .c code with a matching signature for your native method (use
javah on your built Java class as a starting point), in this native code, use JNI methods
to get a pointer to your data, with this pointer, we can figure out how much memory
we need. Call your method that's in a .cu file that you're about to create...
●
[CUDA] Allocate space on the GPU (device side) using cudaMalloc
●
[CUDA] Copy myData to your allocated space (myDataOnTheGPU) using
cudaMemcpyHostToDevice
●
[CUDA] Process your data on the GPU in a kernel (look for <<< and >>>)
●
[CUDA] Copy the result back (what's now at myDataOnTheGPU replaces myData on
the host) using cudaMemcpyDeviceToHost
●
[JNI] Release the elements (updating your JNI pointer so the data in our JVM heap is
now the result)
●
[Java] Interact with your data normally as you're back in the Java world
How might we use a GPU with Java or Scala?
© 2017 IBM Corporation
31
Working example: file overview and script
© 2017 IBM Corporation
32
Working example: Java code
© 2017 IBM Corporation
33
Working example: header file
© 2017 IBM Corporation
34
Working example: c++ code (matching the header)
© 2017 IBM Corporation
35
Working example: .cu code (and a simple kernel)
© 2017 IBM Corporation
36
Working example: checking the results
© 2017 IBM Corporation
37
Pitfalls to look out for
objdump mysharedlibrary.so -t | grep yourmethodname is very useful for
unsatisfied link errors...
[aroberts@devoxx withjava]$ objdump lib/devoxx.so -t | grep "addX"
00000000000053c4 g F .text 000000000000005f
_Z28Java_SimpleJava_addXToMyIntsP7JNIEnv_P7_jclassP10_jintArrayi
Name mangling can occur (use “extern C {} blocks” in your .cpp and .cu code)
[aroberts@devoxx withjava]$ ./BuildAndRun.sh
Unhandled exception
Type=Segmentation error vmState=0x00000000
Unsafe world now – check your memory accesses - ensure all of your pointers are
still valid, printfs and gdb for debugging, Nsight/cuda-gdb/cuda-memcheck for CUDA
specific help, more on this later
© 2017 IBM Corporation
38
printf statements added...
[aroberts@devoxx withjava]$ ./BuildAndRun.sh
getting elements
got em!
launching kernel...
addToMe+0x20 (0x00007F48441B630F [devoxx.so+0x530f])
Java_SimpleJava_addXToMyInts+0x5c (0x00007F48441B6440 [devoxx.so+0x5440])
(0x00007F4854264F9B [libj9vm28.so+0x8ff9b])
Unhandled exception
^^ Check your memory accesses!
Remember to call your kernel with the <<< and >>> syntax (in a .cu file)
Remember to use your device pointer variable as the parameter in your kernel (not the host one) - or you
won't be able to modify your data (it'll act on nothing – the kernel will still launch but your data will remain
unchanged)
You can add printf statements inside of your kernels (printing threadIdx.x which you're likely using as an index
into an array is a good idea)
Yes, you should add bounds checking inside of your kernels
Yes, you should check return codes and use cudaError_t
© 2017 IBM Corporation
39
Is there a simpler way?
Sticking to Java as much as possible
●
Lots of Java projects we want to use
●
Error checking
●
Type safety
●
Debugging tools (core dumps, javacores, system dumps, GCMV, MAT)...
●
Profiling tools (Healthcenter, jprof)...
●
JIT compiler and a garbage collector
●
Portability - until you “go native”, mix byte-ordering across machines while using
sun.misc.unsafe, use other internal APIs relying on field names, find there's no JRE
available...
The approaches we've taken
●
Java Class Library changes
●
Just-In-Time Compiler changes
●
CUDA4J API
●
Apache Spark changes (runs in JVMs)
© 2017 IBM Corporation
40
-Dcom.ibm.gpu.enable/enforce/disable/verbose
40,000,000
400,000,000
Ints sorted
per second
Array length
400m per sec
40m per sec
Sorting throughput for ints
30,000 300,000 3,000,000 30,000,000 300,000,000
Details online here
Making it easier: Java class library modification
© 2017 IBM Corporation
41
-Xjit:enableGPU=”{default, verbose”}
Can be forced with “{enforce”}
Making it easier: Java JIT compiler modification
Using three arrays of randomly generated doubles:
output, firstArray, secondArray – all of size ROWS (2048 here)
Use an IntStream and specify our JIT option
Primitive types can be used (byte, char, short, int, float, double, long)
Run this inside a
loop for an easily
reproducible
example – JIT
must be hot to
make an impact
© 2017 IBM Corporation
42
Results on my laptop
[IBM GPU JIT]: Device Number 0: name=Quadro M1000M, ComputeCapability=5.0
Setting up our arrays, size is 2048x2048
Done setting up!
Starting the GPU enabled lambda, running GPU enabled lambda, parallelism: 1
End time: 42575.864909 msec
Starting the GPU enabled lambda, running GPU enabled lambda, parallelism: 1
End time: 41080.132863 msec
Starting the GPU enabled lambda, running GPU enabled lambda, parallelism: 1
[IBM GPU JIT]: [time.ms=1489774852380]: Launching parallel forEach in
com/ibm/MatMultiExample/MatMulti.runGPULambda()V at line 139 on GPU
[IBM GPU JIT]: [time.ms=1489774853402]: Finished parallel forEach in
com/ibm/MatMultiExample/MatMulti.runGPULambda()V at line 139 on GPU
End time: 1042.937686 msec
With no JIT options provided, over 100 iterations (instead of just five) I still achieve a best
time of 42 seconds. With more threads (setting it to 8 or 32, not 1 by modifying the fork join
common property parallelism) my best time is 32 seconds – still much slower
© 2017 IBM Corporation
43
Measured performance improvement with a GPU using four programs
1-CPU-thread sequential execution
160-CPU-thread parallel execution
Experimental environment used
IBM Java 8 Service Release 2 for PowerPC Little Endian
Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memory
(160 hardware threads in total) with one NVIDIA Kepler K40m GPU (2880
CUDA cores in total) at 876 MHz with 12GB global memory (ECC off)
Performance of GPU enabled lambdas
© 2017 IBM Corporation
44
This shows GPU execution time speedup amounts compared to
what's in blue (1 CPU thread) and yellow (160 CPU threads)
The higher the bar, the bigger the speedup!
© 2017 IBM Corporation
45
Name Summary Data size Data type
MM Dense matrix
multiplication: C
= A x B
1024 x 1024
(1m) items
double
SpMM As above but
sparse
500k x 500k
(250m) items
double
Jacobi2d Solve an
equation using
the Jacobi
method
8192 x 8192
(67m) items
double
LifeGame Conway's Game
of Life with 10k
iterations
512 x 512
(262k) items
byte
© 2017 IBM Corporation
46
bytecodes
intermediate
representation
optimizer
CPU GPU
code generator code generator
PTX ISACPU native
As the JIT compiles a stream expression we can identify
candidates for GPU off-loading
●
Data copied to and from the device implicitly
●
Java operations mapped to GPU kernel operations
●
JIT takes care of GPU data alignment, cache management
●
Optimizes data transfer
●
Manages multiple devices
Advantages
●
Reuses standard Java idioms, so no new API is required
●
Preserves standard Java semantics
●
No knowledge of GPU programming model required by the
application
developer
●
Takes care of low level details: GPU devices capabilities, etc.
●
Chooses optimal execution mode: CPU, GPU, or SIMD
●
Future performance improvements in the JIT do not require
code changes
© 2017 IBM Corporation
JVM:
• Class loading
• Method resolution
• Object creation and GC
• Exception handling
Java array
CPU
Redirection to CPU
(at compile or runtime)
Copy over PCIe
GPU copy of Java array
• Optimized lambda code
executed by multiple
threads in a data parallel
manner
• Exception detection
GPU
The JIT compiler will check that the lambda expression satisfies the following criteria:
• Only accesses primitive types, and one-dimensional arrays of primitive types
• No access to static scalar variables: only locals, parameters, or instance variables
• No unresolved or native methods
• No creating new heap Objects (new ...), exceptions, (throw …), or instanceof
• Intermediate stream operations like map or filter are not supported
Limitations
GPU memory isn't an extension of the Java heap
© 2017 IBM Corporation
48
• JIT applies various performance heuristics to determine execution mode of the lambda
expression (sequential, fork-join, GPU, or SIMD)
• Heuristics depend on numerous factors and may change in the future to become more
accurate, to deal with new architecture characteristics, etc
• Currently, they are relatively conservative
•
We will work on new heuristics based on customer feedback
• To observe if forEach was sent to GPU use –Xjit:enableGPU={verbose}
• To override performance heuristics use –Xjit:enableGPU={enforce}
• For combining options: -Xjit:enableGPU=”enforce|verbose” will work: the quotes are
important lest your bash shell interpret | as a pipe!
• Give it a go for yourself keeping the criteria for code to be eligible for GPU offloading in mind
• We are using NVVM IR
JIT based optimizations: performance heuristics
© 2017 IBM Corporation
49
Production ready and supported by IBM – used to manipulate GPU devices
●
Compared to Jcuda: no arbitrary and unrestricted use of Pointer(long), feels
more like Java instead of C
Write your CUDA kernel (yes, the hard part!) and compile it into a fat binary
nvcc --fatbin AdamKernel.cu
Add your Java code
import com.ibm.cuda.*;
import com.ibm.cuda.CudaKernel.*;
Load your fat binary (module loading code at the end of this presentation)
module = new
Loader().loadModule("AdamDoubler.fatbin",device);
Build and run as you would any other Java application
Making it easier: CUDA4J API
© 2017 IBM Corporation
50
CudaDevice a CUDA capable GPU device
CudaStream a sequence of operations on the GPU
CudaBuffer a region of memory on the GPU
CudaModule user library of kernels to load into GPU
CudaKernel launching a device function
CudaFunction a kernel's entry point
CudaEvent timing and synchronization
CudaException when something goes wrong
There are times when you want this low level GPU control from Java
●
We developed an API that reflects the concepts familiar in CUDA programming
●
Makes use of Java exceptions, automatic resource management, etc.
●
Handles copying data to/from the GPU, flow of control from Java to GPU and back
●
Ability to invoke existing GPU module code from Java applications e.g. Thrust
CUDA4J class
mapping
© 2017 IBM Corporation
51
Only doubling integers; could be any use case where we're doing the same
operation to lots of elements at once
Full code listing at the end, Javadocs: search IBM Java 8 API com.ibm.cuda
* Tip: the offsets are byte offsets, so you'll want your index in Java * the size of the object!
module = new Loader().loadModule("AdamDoubler.fatbin", device);
kernel = new CudaKernel(module, "Cuda_cuda4j_AdamDoubler_Strider");
stream = new CudaStream(device);
numElements = 100;
myData = new int[numElements];
Util.fillWithInts(myData);
CudaGrid grid = Util.makeGrid(numElements, stream);
buffer1 = new CudaBuffer(device, numElements * Integer.BYTES);
buffer1.copyFrom(myData);
Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements);
kernel.launch(grid, kernelParams);
buffer1.copyTo(myData);
If our dynamically created grid
dimensions are too big we need to
break down the problem and use the
slice* API: doChunkingProblem()
Our kernel, compiles into AdamDoubler.fatbin
© 2017 IBM Corporation
52
●
Integrating CUDA GPU offloading support into your existing
Java applications without needing to worry about JNI,
makefiles, managing GPU memory and writing C++ code (you
still need to write your kernel)
●
Identify your most commonly used functions as candidates
(simple manual profiling or using tools such as
Healthcenter for method profiling)
●
Tinker with heuristics and benchmark new capabilities
●
Be wary of the GPU limitations (e.g device memory amount,
max grid size – may need to chunk up your problem)
Where would this be useful?
© 2017 IBM Corporation
53
●
Open source project (the most active for big data) offering distributed...
●
Machine learning
●
Graph processing
●
Core operations (map, reduce, joins)
●
SQL syntax with DataFrames/Datasets
●
Many input formats supported e.g Parquet, JSON, files stored on HDFS you can parse trivially, CSV
with a Databricks package
●
Interoperability with Kafka, Hive, many more (see Apache Bahir also)
●
Compression codecs and automatic usage, fast serialization with Kryo
●
Offers scalability and resiliency
●
Lots of Scala – so runs in JVMs (exploits sun.misc.unsafe heavily)
●
Python, R, Scala and Java APIs
●
Eligible for our Java based optimisations
Ask after the talk for more details – my current role involves contributing, fixing, evangelising Spark as well as
producing demos and working on customer problems – especially interested in data visualization tools
Improving Apache Spark
© 2017 IBM Corporation
54
What machine learning algorithms?
Popular algorithms that'll run in a (potentially)
distributed manner include
●
Alternating least squares
●
K-means
●
Naive Bayes
●
Logistic regression
●
Random forests
●
Decision trees
●
Principal component analysis
© 2017 IBM Corporation
55
●
Recommendation algorithms such as
●
Alternating Least Squares
●
Movie recommendations on Netflix?
●
Recommended purchases on Amazon?
●
Similar songs with Spotify?
●
Recommended videos on YouTube?
●
Clustering algorithms such as
●
K-means (unsupervised learning (no labels, cheap))
●
Produce clusters from data to determine which cluster a
new item can be categorised as
●
Identify anomalies: transaction fraud or erroneous data
●
Classification algorithms such as
●
Logistic regression
●
Create a model that we can use to predict where to
plot the next item in a sequence (above or below our line
of best fit)
●
Healthcare: predict adverse drug reactions based on
known interactions between similar drugs
●
Spam filter (binomial classification)
Known good candidates
© 2017 IBM Corporation
56
An example: we have the following .csv file for bands..
<username, band name, band genre (a feature), rating>
Adam,ACoolBand1,AGenre,5
Adam,ACoolBand2,AGenre,5
Adam,ACoolBand3,AGenre,5
George,ACoolBand1,AGenre,5
George,ACoolBand2,AGenre,5
George,ACoolBand3,AGenre,5
George,ACoolBand4,AGenre,5
So if we were to guess if Adam likes ACoolBand4 as well, the score would be
very close to 5 – we can infer it based on already known observations
How Alternating Least Squares works
Very much simplified, ALS works by
factorizing the rating matrix and
minimising the loss on observed
ratings (our ratings matrix will be
sparse and we want to complete it –
see “CuMF: Large-Scale Matrix
Factorization on Just One Machine
with GPUs. Nvidia GTC 2016 talk” by
Wei Tan for an excellent summary
© 2017 IBM Corporation
57
But what if there's a band way down the list in a place we can't fit into memory?
●
Zack_Zwick: ObscureBand27, AGenrel, 5
How can we infer that Adam will also probably like ObscureBand27?
●
We still want to be using GPUs (and we want the results fast – Adam's using a
premium online service and doesn't want to wait hours to get a good match)
●
Covered in the paper I cite next (Tan. Wei (IBM), Cao. Liaoliang (Yahoo!, IBM at
the time of the work done), Fong. Liana (IBM), to summarise:
“cuMF first generates a partition scheme, planning which partition to send to
which GPU in what order. With this knowledge in advance, cuMF uses separate
CPU threads to preload data from disk to host memory, and separate CUDA
streams to preload from host memory to GPU memory. By this proactive and
asynchronous data loading, we manage to handle out-of-core problems with
close-to-zero data loading time except for the first load.”
●
https://github.com/cuMF/cumf_als explains how to run this in batches
© 2017 IBM Corporation
58
●
Under the covers optimisation, set the spark.mllib.ALS.useGPU property
●
Full paper: http://arxiv.org/abs/1603.03820
●
Full implementation for raising issues and giving it a try for yourself:
https://github.com/IBMSparkGPU, with 1.5 gb of a Netflix dataset:
Experiment 12 threads, CPU 64 threads, CPU 2x GPUs
Time 676 seconds N/A 140 seconds
Our implementation is open source and cited above, we used:
2x Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz, 16 cores in the
machine (SMT-2), 256 GB RAM vs 2x Nvidia Tesla K80Ms. Also
available for IBM Power LE.
Our approach for Apache Spark
© 2017 IBM Corporation
59
Implemented the vanilla C++/Java/CUDA way so this would work with any JDK (tiny
amount of C++ code and lots of CUDA for Spark – we only override one function)
●
modified the existing ALS (.scala) implementation's computeFactors method
●
added code to check if spark.mllib.ALS.useGPU is set
●
if set we'll then call our native method written to ue JNI (.cpp)
●
our JNI method calls native CUDA (.cu) method
●
CUDA used to send our data to the GPU, calls our kernel, returns the results over JNI
back to the Java heap
Bundled with our Spark distribution and the shared library is included: libGPUALS.so
●
Remember this will require the CUDA runtime (libcudart) and a capable GPU
ALS.scala
computeFactors
CuMFJNIInterface.cpp
ALS.cu libGPUALS.so
© 2017 IBM Corporation
60
We can send generated code to GPUs and alter the code that's
generated to conform to certain characteristics...
Input: user application using Spark DataFrame or Dataset API (SQL-like
syntax, perform queries on data stored in tables)
✔
Spark with Tungsten. Uses UnsafeRow and, sun.misc.unsafe, idea is to bring Spark closer
to the hardware than previously, exploit CPU caches, improved memory and CPU
efficiency, reduce GC times, avoid Java object overheads – good deep dive here
✔
Spark with Catalyst. Optimiser for Spark SQL APIs, good deep dive here, transforms a
query plan (abstraction of a user's program) into an optimised version, generates
optimised code with Janino compiler
✔
Spark with our changes: Java and core Spark class optimisations, optimised JIT
More pervasive GPU opportunities for Spark
© 2017 IBM Corporation
61
Output: generated code able to leverage auto-SIMD and GPUs
We want generated code that:
✔
has a counted loop, e.g. one controlled by an automatic induction
variable that increases from a lower to an upper bound
✔
accesses data in a linear fashion
✔
has as few branches as possible (simple for the GPU's kernel)
✔
does not have external method calls or contains only calls that can
be easily inlined
These help a JIT to either use auto-SIMD capabilities or GPUs
© 2017 IBM Corporation
62
Problems
1) Data representation of columnar storage (CachedBatch with Array[Byte]) isn't commonly used
2) Compression schemes are specific to CachedBatch, limited to just several data types
3) Building in-memory cache involves a long code path -> virtual method calls, conditional branches
4) Generated whole-stage code -> unnecessary conversion from CachedBatch or ColumnarBatch to
UnsafeRow
Solutions
1) Use ColumnarBatch format instead of CachedBatch for the in-memory cache generated by the
cache() method. ColumnarBatch and ColumnVector are commonly used data representations for
columnar storage
2) Use a common compression scheme (e.g. lz4) for all of the data types in a ColumnVector
3) Generate code at runtime that is simple and specialized for building a concrete instance of the in-
memory cache
4) Generate whole-stage code that directly reads data from columnar storage
(1) and (2) increase code reuse, (3) improves runtime performance of executing the cache() method
and (4) improves performance of user defined DataFrame and Dataset operations
© 2017 IBM Corporation
63
We propose a new columnar format: CachedColumnarBatch, that has a pointer to
ColumnarBatch (used by Parquet reader) that keeps each column as
OnHeapUnsafeColumnVector instead of OnHeapColumnVector.
Not yet using GPUS!
●
[SPARK-13805], merged into 2.0, performance improvement: 1.2x
Get data from ColumnVector directly by avoiding a copy from ColumnVector to
UnsafeRow when a program reads data in parquet format
●
[SPARK-14098] targeted for Spark 2.2, performance improvement: 3.4x
Generate optimized code to build CachedColumnarBatch, get data from a
ColumnVector directly by avoiding a copy from the ColumnVector to UnsafeRow,
and use lz4 to compress ColumnVector when df.cache() or ds.cache is executed
●
[SPARK-15962], merged into 2.1, performance improvement: 1.7x
Remove indirection at offsets field when accessing each element in
UnsafeArrayData, reduce memory footprint of UnsafeArrayData
© 2017 IBM Corporation
64
●
[SPARK-15985], merged into 2.1, performance improvement: 1.3x
Eliminate boxing operations to put a primitive array into GenericArrayData
when a Dataset program with a primitive array is ran
●
[SPARK-16213], merged into 2.2, performance improvement: 16.6x
Eliminate boxing operations to put a primitive array into GenericArrayData
when a DataFrame program with a primitive array is ran
●
[SPARK-17490], merged into 2.1, performance improvement: 2.0x
Eliminate boxing operations to put a primitive array into
GenericArrayData when a DataFrame program with a primitive array is
used
© 2017 IBM Corporation
65
●
Another way to make exploiting GPUs easier
●
We know how to build GPU based applications
●
We can figure out if a GPU is available
●
We can figure out what code to generate
●
We can figure out which GPU to send that code to
●
All while retaining Java safety features such as exceptions, bounds
checking, serviceability, tracing and profiling hooks...
Assuming you have the hardware, add an option and watch performance
improve: this is ongoing work that can likely be applied to other projects
What's in it for me if I care about Spark?
© 2017 IBM Corporation
66
We want developers to be aware of these so we can work together
●
Restricted by PCIe speed (less so with IBM hardware, Nvlink on Power)
●
Writing a decent kernel is hard
●
Optimum use of different memory types (global, shared, texture,
registers), debugging (lots of seg faults, you're in the CUDA world now!),
limited functions you can use in a kernel, maintaining contiguous access
where possible
●
Not many GPU developers out there relative to other language pros: want
developers that know machine learning, know Java, know CUDA, know how
to debug and profile
●
Watch lots of videos and experiment – breaking things as you go and
learning; need to achieve max parallelism, avoid seg faults, good fun
●
Big changes to the CUDA SDK itself: this is for CUDA 7.5 and I learned with
CUDA 5.5, likely lots of new features I'm not exploiting - keep up to date, I've
seen important bug fixes going into driver/SDK releases
Challenges for GPU programming
© 2017 IBM Corporation
67
●
Profiling – many variables to tweak (and therefore many opportunities
for benchmarking fun)
●
More pitfalls than Java unless you're using sun.misc.unsafe or JNI
●
CUDA was initially a problem to set up on my laptop (wanting to
keep my desktop, use the Nvidia driver, use the CUDA toolkit AND a
projector…)
●
Debugging in a massively multithreaded environment...be careful of
race conditions
●
Ideally developers can focus on the kernel logic and design principles
instead of how to write GPU code and how to manage things like
scheduling and partitioning strategies (if you were to accelerate
Apache Spark for example – lots of challenges here)
© 2017 IBM Corporation
68
Bad JNI code
●
Getting and releasing elements: invalid pointers, incorrect usage!
●
Exceeding bounds
●
Mixing C and C++ (*env)→ is C, env→ is C++
Bad Java code
Sun.misc.Unsafe usage leading to seg faults
Bad CUDA code
Lots can go wrong here…
●
Bad kernel – exceeding bounds, sending junk data, inefficient use of memory types
●
Bad cudaMalloc, cudaMemcpy calls
●
Not checking return codes
Bad C/C++ code
A presentation in its own right
What else could possibly go wrong?
© 2017 IBM Corporation
69
Bad design - your application just isn't a good fit for GPUs
●
Not enough data for it to be worthwhile
●
Too complex a problem – making new objects, lots of branching code, NOT doing lots of
floating point/arithmetic operations…
●
Way too much data to fit on GPUs and it'll be very difficult/time consuming to chunk it up
(not all problems are going to be model parallelisable or data parallelisable)
My CPUs are already good and cheap – and I'm not the one managing them
anyway! I'll just get a few more instead of that brand spanking new GPU I may need
to learn CUDA for...
Use Apache Spark instead?
Purchase more effective hardware – perhaps IO/network devices?
Profiling – find out you're not actually compute bound?
Write more efficient Java/Scala code (good use of private/final, benchmarking everything, again a
talk in its own right, search “London Spark Meetup IBM Spark in production”
Does it need to be in Java at all?
© 2017 IBM Corporation
70
I'll only briefly cover each project as these are all excellent projects in
their own right and worthy of their own detailed talks
●
OpenCL
●
Low level framework simplifies development for devices such as GPUs,
FPGAs, works on AMD cards too, maintained by Khronos Group
●
TensorFlow
●
Java, C++, Python APIs, built for machine learning, researchers, data
scientists, open source (mainly developed by Google) – features
offloading to CUDA devices (requiring the toolkit/driver/cuDNN)
●
SystemML
●
IBM open-sourced project (now an Apache incubator), recently
committed GPU support (yet to try) – write code in DML, easily
scale out once ready
Interesting projects related to GPUs/Java
© 2017 IBM Corporation
71
●
Jcuda
●
Alternative to CUDA4J (no IBM Java requirement) – more like C than
Java, plenty of bindings for Nvidia libraries available and open source
●
DeepLearning4j
●
Want to find out more on this in particular: the most popular open
source deep-learning framework for the JVM – mentions built-in GPU
support so desperate to try this out; anything making GPU exploitation
easier is good
●
Nvidia libraries such as
●
cuDNN
●
Deep neural network library for CUDA devices
●
cuBLAS
●
Basic linear algebra subroutines on the GPU
●
Thrust
●
Precanned algorithms for HPC on CUDA devices (e.g. sorting)
© 2017 IBM Corporation
72
●
Aparapi
●
Excellent video by AMD Runtimes team on it (very few views on YouTube,
highly recommended)
●
Java API for the GPU – also Java styled like CUDA4J
●
Converts Java code to OpenCL
●
Requires overloading the run routine in a kernel class already (like you would
for java.lang.Thread)
●
jar file and shared libraries with JNI (so no JVM changes)
●
Project Sumatra
●
OpenJDK initiative for GPU support as part of the Java SDK
●
Excellent video: “Sumatra OpenJDK Project Update” with < 120 views on
YouTube!
●
Details their approach to optimising forEach and reduce using GPUs
●
Not tried this – mentions building Graal and Sumatra to get a “HSA enabled
Graal-based JDK”, would be interesting to either collaborate/compare
findings, and to look into Wholly Graal (another good video on YouTube
about this – again with very few views)
© 2017 IBM Corporation
73
Simple debugging example
BuildAndRun.sh uses nvcc then executes the .out, code involves:
int numElements=5000000;
I initialize the arrays with the following code (the problem cause):
double first[numElements];
double second[numElements];
Then do some CUDA stuff...so let's get a core to inspect
© 2017 IBM Corporation
74
[aroberts@devoxx fun]$ gdb core.20916
[New LWP 20916]
Core was generated by `./a.out'.
Program terminated with signal 11,
Segmentation fault.
#0 0x000000000040263a in ?? ()
"/home/aroberts/fun/core.20916" is a core
file.
Please specify an executable to debug.
(gdb) bt
Python Exception <class 'gdb.MemoryError'>
Cannot access memory at address
0x7ffd88544d10:
(gdb) info registers
rax 0x0 0
rbx 0x0 0
rcx 0xa0 160
rdx 0x7ffd88544d10 140726890679568
rsi 0x7ffe47106e58 140730090679896
rdi 0x1 1
rbp 0x7ffe47106d70 0x7ffe47106d70
rsp 0x7ffd88544d10 0x7ffd88544d10
✔
stack pointer
✔
We get a memory error
➢
How big is numElements?
➢
You want an array on
the stack that's how big?!
© 2017 IBM Corporation
75
1) Rule out a CUDA programming mistake first (GPU side) – tweaking the
kernel, judicious use of printf statements
2) Rule out a C++ side mistake (host side)
Tools can help here if there's a standalone version of the application first –
tuned and tweaked before being called out to from Java
cuda-memcheck ./a.out
========= CUDA-MEMCHECK
========= Error: process didn't terminate successfully
========= The application may have hit an error when dereferencing Unified Memory
from the host. Please rerun the application under cuda-gdb or Nsight Eclipse Edition to catch
host side errors.
========= Internal error (20)
========= No CUDA-MEMCHECK results found
© 2017 IBM Corporation
76
Add “-g” to the nvcc compiler options for debug symbols, then run cuda-gdb again with
your application (e.g. cuda-gdb a.out)
(cuda-gdb) run
Starting program: /home/aroberts/devoxx/./a.out
Program received signal SIGSEGV, Segmentation fault.
0x0000000000402657 in main () at MatMulti.cu:14
14 printf("Startingn");
But that's the first line in my program!
(cuda-gdb) c
Continuing.
Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
I change my allocations to go on the heap using malloc.
“I didn't need to worry about that in Java!”
© 2017 IBM Corporation
77
And what happens if...
●
We ask for too much memory on the GPU
cudaError_t ret = cudaMalloc(&whopper, 1337999999);
out of memory reported (using cudaGetErrorString(ret))
●
We specify we want too big a kernel using huge dimensions (so we must be
careful generating these programatically i.e. using CUDA4J)
__global__ void callMe(int anything) {}
callMe<<<100000,1000>>>(100):
CUDA error: invalid argument reported (I used cudaDeviceSynchronize
and cudaGetLastError combined with cudaGetErrorString)
●
Out of bounds in a kernel
Junk data: send 50 ints to the GPU but launch kernel with a <<<1, 256>>>
configuration and printf(“%d “, array[threadIdx.x]) in the
kernel:
-11039938 -1 -1 -1 -14325499 -722701 -1 -328454 -14588416
© 2017 IBM Corporation
78
●
Global C++ variable referenced in my kernel
Won't compile, undefined variable
●
Kernel taking forever AND a useless kernel (while true loop)
Message from syslogd@oc3752625144 at Mar 18 20:59:44
...kernel:NMI watchdog: BUG: soft lockup - CPU#4
stuck for 22s! [Xorg:2167]
Can't do anything during this period – only tested on my laptop
●
You try to use malloc inside a kernel
●
No errors, can do int* hello = (int*) malloc(256); then
hello[threadIdx.x] = threadIdx.x; printing
hello[threadIdx.x] kind of works but I don't get all of the
output on each application run (using a kernel config <<<1,256>>>)
●
cudaMalloc?
Not allowed as it's a __host__ function (and malloc isn't?!)
© 2017 IBM Corporation
79
●
You try to call an arbitrary host function in your kernel
Same as on the previous slide, not allowed (prevented by the compiler)
●
Your kernel A calls kernel B
Works but call the kernel as you would from the host (<<<,>>>), need to
compile with nvcc -arch=sm_35 -rdc=true
●
Your kernel A calls kernel B which then calls kernel A…
© 2017 IBM Corporation
80
●
Easy way to get the latest JDKs (optionally with Apache Spark) below
●
Great if you know CUDA already – but not required
●
GPUs don't need to be expensive – but the server ones will be
●
Useful for certain operations – not the “be all and end all” that's guaranteed to give
you a boost, but why not make use of it if you have it
●
Discoveries to be had performance testing with simple timers, tuning kernels,
writing optimised CUDA code as well as your Java code (or using existing libraries)
●
Lots of projects out there combining Java and GPUS! We're especially interested in
delivering runtime improvements with minimal to no code changes required –
partially by improving the IBM J9 VM itself (look out for OpenJ9)
http://ibm.biz/spark-kit
Feedback and suggestions welcome: aroberts@uk.ibm.com
Closing remarks
© 2017 IBM Corporation
CUDA4J example code beyond this point.
CUDA4J sample, part 1 of 3
import com.ibm.cuda.*;
import com.ibm.cuda.CudaKernel.*;
public class Sample {
private static final boolean PRINT_DATA = false;
private static int numElements;
private static int[] myData;
private static CudaBuffer buffer1;
private static CudaDevice device = new CudaDevice(0);
private static CudaModule module;
private static CudaKernel kernel;
private static CudaStream stream;
public static void main(String[] args) {
try {
module = new Loader().loadModule("AdamDoubler.fatbin", device);
kernel = new CudaKernel(module, "Cuda_cuda4j_AdamDoubler_Strider");
stream = new CudaStream(device);
doSmallProblem();
doMediumProblem();
doChunkingProblem();
} catch (CudaException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
}
private final static void doSmallProblem() throws Exception {
System.out.println("Doing the small sized problem");
numElements = 100;
myData = new int[numElements];
Util.fillWithInts(myData);
CudaGrid grid = Util.makeGrid(numElements, stream);
System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>");
buffer1 = new CudaBuffer(device, numElements * Integer.BYTES);
buffer1.copyFrom(myData);
Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements);
kernel.launch(grid, kernelParams);
int[] originalArrayCopy = new int[myData.length];
System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length);
buffer1.copyTo(myData);
Util.checkArrayResultsDoubler(myData, originalArrayCopy);
}
private final static void doMediumProblem() throws Exception {
System.out.println("Doing the medium sized problem");
numElements = 5_000_000;
myData = new int[numElements];
Util.fillWithInts(myData);
// This is only when handling more than max blocks * max threads per kernel
// Grid dim is the number of blocks in the grid
// Block dim is the number of threads in a block
// buffer1 is how we'll use our data on the GPU
buffer1 = new CudaBuffer(device, numElements * Integer.BYTES);
// myData is on CPU, transfer it
buffer1.copyFrom(myData);
// Our stream executes the kernel, can launch many streams at once
CudaGrid grid = Util.makeGrid(numElements, stream);
System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX +
">>>");
Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements);
kernel.launch(grid, kernelParams);
int[] originalArrayCopy = new int[myData.length];
System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length);
buffer1.copyTo(myData);
Util.checkArrayResultsDoubler(myData, originalArrayCopy);
}
CUDA4J sample, part 2 of 3
private final static void doChunkingProblem() throws Exception {
// I know 5m doesn't require chunking on the GPU but this does
System.out.println("Doing the too big to handle in one kernel problem");
numElements = 70_000_000;
myData = new int[numElements];
Util.fillWithInts(myData);
buffer1 = new CudaBuffer(device, numElements * Integer.BYTES);
buffer1.copyFrom(myData);
CudaGrid grid = Util.makeGrid(numElements, stream);
System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>");
// Check we can actually launch a kernel with this grid size
try {
Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements);
kernel.launch(grid, kernelParams);
int[] originalArrayCopy = new int[numElements];
System.arraycopy(myData, 0, originalArrayCopy, 0, numElements);
buffer1.copyTo(myData);
Util.checkArrayResultsDoubler(myData, originalArrayCopy);
} catch (CudaException ce) {
if (ce.getMessage().equals("invalid argument")) {
System.out.println("it was invalid argument, too big!");
int maxThreadsPerBlockX = device.getAttribute(CudaDevice.ATTRIBUTE_MAX_BLOCK_DIM_X);
int maxBlocksPerGridX = device.getAttribute(CudaDevice.ATTRIBUTE_MAX_GRID_DIM_Y);
long maxThreadsPerGrid = maxThreadsPerBlockX * maxBlocksPerGridX;
// 67,107,840 on my Windows box
System.out.println("Max threads per grid: " + maxThreadsPerGrid);
long numElementsAtOnce = maxThreadsPerGrid;
long elementsDone = 0;
grid = new CudaGrid(maxBlocksPerGridX, maxThreadsPerBlockX, stream);
System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>");
while (elementsDone < numElements) {
if ( (elementsDone + numElementsAtOnce) > numElements) {
numElementsAtOnce = numElements - elementsDone; // Just do the remainder
}
long toOffset = numElementsAtOnce + elementsDone;
// It's the byte offset not the element index offset
CudaBuffer slicedSection = buffer1.slice(elementsDone * Integer.BYTES, toOffset * Integer.BYTES);
Parameters kernelParams = new Parameters(2).set(0, slicedSection).set(1, numElementsAtOnce);
kernel.launch(grid, kernelParams);
elementsDone += numElementsAtOnce;
}
int[] originalArrayCopy = new int[myData.length];
System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length);
buffer1.copyTo(myData);
Util.checkArrayResultsDoubler(myData, originalArrayCopy);
} else {
System.out.println(ce.getMessage());
}
}
}
CUDA4J sample, part 3 of 3
CUDA4J kernel
#include <stdint.h>
#include <stdio.h>
/**
* 2D grid so we can have 1024 threads and many blocks
* Remember 1 grid -> has blocks/threads and one kernel runs on one grid
* In CUDA 6.5 we have cudaOccupancyMaxPotentialBlockSize which helps
*
* Let's say we have 100 ints to double, keeping it simple
* Assume we want to run with 256 threads at once
* For this size our kernel will be set up as follows
* 1 grid, 1 block, 512 threads
* blockDim.x is going to be 1
* threadIdx.x will remain at 0
* threadIdx.y will range from 0 to 512
* So we'll go from 1 to 512 and we'll limit access to how many elements we
have
*/
extern "C" __global__ void Cuda_cuda4j_AdamDoubler(int* toDouble, int
numElements){
int index = blockDim.x * threadIdx.x + threadIdx.y;
if (index < numElements) { // Don't go out of bounds
toDouble[index] *= 2; // Just double it
}
}
extern "C" __global__ void Cuda_cuda4j_AdamDoubler_Strider(int* toDouble,
int numElements){
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < numElements) { // don't go overboard
toDouble[i] *= 2;
}
}
Utility methods, part 1 of 2
package com.ibm.CUDA4JExample;
import com.ibm.cuda.*;
public class Util {
protected final static void fillWithInts(int[] toFill) {
for (int i = 0; i < toFill.length; i++) {
toFill[i] = i;
}
}
protected final static void fillWithDoubles(double[] toFill) {
for (int i = 0; i < toFill.length; i++) {
toFill[i] = i;
}
}
protected final static void printArray(int[] toPrint) {
System.out.println();
for (int i = 0; i < toPrint.length; i++) {
if (i == toPrint.length - 1) {
System.out.print(toPrint[i] + ".");
} else {
System.out.print(toPrint[i] + ", ");
}
}
System.out.println();
}
protected final static CudaGrid makeGrid(int numElements, CudaStream stream) {
int numThreads = 512;
int numBlocks = (numElements + (numThreads - 1)) / numThreads;
return new CudaGrid(numBlocks, numThreads, stream);
}
/*
* Array will have been doubled at this point
*/
Protected final static void checkArrayResultsDoubler(int[] toCheck, int[] originalArray) {
long errorCount = 0;
// Check result, data has been copied back here
if (toCheck.length != originalArray.length) {
System.err.println("Something's gone horribly wrong, different array length");
}
for (int i = 0; i < originalArray.length; i++) {
if (toCheck[i] != (originalArray[i] * 2) ) {
errorCount++;
/*
System.err.println("Got an error, " + originalArray[i] +
" is incorrect: wasn't doubled correctly!" +
" Got " + toCheck[i] + " but should be " + originalArray[i] * 2);
*/
} else {
//System.out.println("Correct, doubled " + originalArray[i] + " and it became " +
toCheck[i]);
}
}
System.err.println("Incorrect results: " + errorCount);
}
}
Utility methods, part 2 of 2
CUDA4J module loader
package com.ibm.CUDA4JExample;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import com.ibm.cuda.CudaDevice;
import com.ibm.cuda.CudaException;
import com.ibm.cuda.CudaModule;
public final class Loader {
private final CudaModule.Cache moduleCache = new CudaModule.Cache();
final CudaModule loadModule(String moduleName, CudaDevice device) throws
CudaException, IOException {
CudaModule module = moduleCache.get(device, moduleName);
if (module == null) {
try (InputStream stream = getClass().getResourceAsStream(moduleName)) {
if (stream == null) {
throw new FileNotFoundException(moduleName);
}
module = new CudaModule(device, stream);
moduleCache.put(device, moduleName, module);
}
}
return module;
}
}

Weitere ähnliche Inhalte

Was ist angesagt?

Containerization & Docker - Under the Hood
Containerization & Docker - Under the HoodContainerization & Docker - Under the Hood
Containerization & Docker - Under the HoodImesha Sudasingha
 
Kubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by ExampleKubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by ExampleThomas Riley
 
CSV and JSON Transformation in WSO2 Micro Integrator 4.0 - WSO2 APIM Communit...
CSV and JSON Transformation in WSO2 Micro Integrator 4.0 - WSO2 APIM Communit...CSV and JSON Transformation in WSO2 Micro Integrator 4.0 - WSO2 APIM Communit...
CSV and JSON Transformation in WSO2 Micro Integrator 4.0 - WSO2 APIM Communit...WSO2
 
An Introduction To Jenkins
An Introduction To JenkinsAn Introduction To Jenkins
An Introduction To JenkinsKnoldus Inc.
 
Introduction to Docker
Introduction to DockerIntroduction to Docker
Introduction to DockerAditya Konarde
 
Linux System Monitoring
Linux System Monitoring Linux System Monitoring
Linux System Monitoring PriyaTeli
 
Native, Hybrid, or Cross-platform Development? What Type of Mobile App is Bes...
Native, Hybrid, or Cross-platform Development? What Type of Mobile App is Bes...Native, Hybrid, or Cross-platform Development? What Type of Mobile App is Bes...
Native, Hybrid, or Cross-platform Development? What Type of Mobile App is Bes...ReformedTech
 
Docker 101 : Introduction to Docker and Containers
Docker 101 : Introduction to Docker and ContainersDocker 101 : Introduction to Docker and Containers
Docker 101 : Introduction to Docker and ContainersYajushi Srivastava
 
Python-for-DevOps-Learn-Ruthlessly-Effective-Automation-by-Noah-Gift_-Kennedy...
Python-for-DevOps-Learn-Ruthlessly-Effective-Automation-by-Noah-Gift_-Kennedy...Python-for-DevOps-Learn-Ruthlessly-Effective-Automation-by-Noah-Gift_-Kennedy...
Python-for-DevOps-Learn-Ruthlessly-Effective-Automation-by-Noah-Gift_-Kennedy...MinhTrnNht7
 
Dataflow Visualization using ASCII DAG
Dataflow Visualization using ASCII DAGDataflow Visualization using ASCII DAG
Dataflow Visualization using ASCII DAGgree_tech
 
Introduction to Containers and Docker
Introduction to Containers and DockerIntroduction to Containers and Docker
Introduction to Containers and DockerFayçal Bziou
 
Android's HIDL: Treble in the HAL
Android's HIDL: Treble in the HALAndroid's HIDL: Treble in the HAL
Android's HIDL: Treble in the HALOpersys inc.
 
Treinamento Docker Básico
Treinamento Docker BásicoTreinamento Docker Básico
Treinamento Docker BásicoAndré Justi
 
Browser Automation with Playwright – for integration, RPA, UI testing and mor...
Browser Automation with Playwright – for integration, RPA, UI testing and mor...Browser Automation with Playwright – for integration, RPA, UI testing and mor...
Browser Automation with Playwright – for integration, RPA, UI testing and mor...Lucas Jellema
 
Secrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on KubernetesSecrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on KubernetesBruno Borges
 
Docker introduction &amp; benefits
Docker introduction &amp; benefitsDocker introduction &amp; benefits
Docker introduction &amp; benefitsAmit Manwade
 
Pesquisa de satisfacao_com_moradores
Pesquisa de satisfacao_com_moradoresPesquisa de satisfacao_com_moradores
Pesquisa de satisfacao_com_moradoressindiconet
 

Was ist angesagt? (20)

Containerization & Docker - Under the Hood
Containerization & Docker - Under the HoodContainerization & Docker - Under the Hood
Containerization & Docker - Under the Hood
 
Kubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by ExampleKubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by Example
 
CSV and JSON Transformation in WSO2 Micro Integrator 4.0 - WSO2 APIM Communit...
CSV and JSON Transformation in WSO2 Micro Integrator 4.0 - WSO2 APIM Communit...CSV and JSON Transformation in WSO2 Micro Integrator 4.0 - WSO2 APIM Communit...
CSV and JSON Transformation in WSO2 Micro Integrator 4.0 - WSO2 APIM Communit...
 
An Introduction To Jenkins
An Introduction To JenkinsAn Introduction To Jenkins
An Introduction To Jenkins
 
Introduction to Docker
Introduction to DockerIntroduction to Docker
Introduction to Docker
 
Linux System Monitoring
Linux System Monitoring Linux System Monitoring
Linux System Monitoring
 
Native, Hybrid, or Cross-platform Development? What Type of Mobile App is Bes...
Native, Hybrid, or Cross-platform Development? What Type of Mobile App is Bes...Native, Hybrid, or Cross-platform Development? What Type of Mobile App is Bes...
Native, Hybrid, or Cross-platform Development? What Type of Mobile App is Bes...
 
Docker 101 : Introduction to Docker and Containers
Docker 101 : Introduction to Docker and ContainersDocker 101 : Introduction to Docker and Containers
Docker 101 : Introduction to Docker and Containers
 
Python-for-DevOps-Learn-Ruthlessly-Effective-Automation-by-Noah-Gift_-Kennedy...
Python-for-DevOps-Learn-Ruthlessly-Effective-Automation-by-Noah-Gift_-Kennedy...Python-for-DevOps-Learn-Ruthlessly-Effective-Automation-by-Noah-Gift_-Kennedy...
Python-for-DevOps-Learn-Ruthlessly-Effective-Automation-by-Noah-Gift_-Kennedy...
 
Dataflow Visualization using ASCII DAG
Dataflow Visualization using ASCII DAGDataflow Visualization using ASCII DAG
Dataflow Visualization using ASCII DAG
 
Introduction to Containers and Docker
Introduction to Containers and DockerIntroduction to Containers and Docker
Introduction to Containers and Docker
 
Apache Maven
Apache MavenApache Maven
Apache Maven
 
Android's HIDL: Treble in the HAL
Android's HIDL: Treble in the HALAndroid's HIDL: Treble in the HAL
Android's HIDL: Treble in the HAL
 
Treinamento Docker Básico
Treinamento Docker BásicoTreinamento Docker Básico
Treinamento Docker Básico
 
Jenkins
JenkinsJenkins
Jenkins
 
Browser Automation with Playwright – for integration, RPA, UI testing and mor...
Browser Automation with Playwright – for integration, RPA, UI testing and mor...Browser Automation with Playwright – for integration, RPA, UI testing and mor...
Browser Automation with Playwright – for integration, RPA, UI testing and mor...
 
Secrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on KubernetesSecrets of Performance Tuning Java on Kubernetes
Secrets of Performance Tuning Java on Kubernetes
 
Docker Basics
Docker BasicsDocker Basics
Docker Basics
 
Docker introduction &amp; benefits
Docker introduction &amp; benefitsDocker introduction &amp; benefits
Docker introduction &amp; benefits
 
Pesquisa de satisfacao_com_moradores
Pesquisa de satisfacao_com_moradoresPesquisa de satisfacao_com_moradores
Pesquisa de satisfacao_com_moradores
 

Andere mochten auch

Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaTim Ellison
 
Lean and Easy IoT Applications with OSGi and Eclipse Concierge
Lean and Easy IoT Applications with OSGi and Eclipse ConciergeLean and Easy IoT Applications with OSGi and Eclipse Concierge
Lean and Easy IoT Applications with OSGi and Eclipse ConciergeDev_Events
 
Easy-peasy OSGi Development with Bndtools - Neil Bartlett
Easy-peasy OSGi Development with Bndtools - Neil BartlettEasy-peasy OSGi Development with Bndtools - Neil Bartlett
Easy-peasy OSGi Development with Bndtools - Neil Bartlettmfrancis
 
Bndtools 101 - N Bartlett
Bndtools 101 - N BartlettBndtools 101 - N Bartlett
Bndtools 101 - N Bartlettmfrancis
 
What's Coming in Bndtools 3.0 and Beyond
What's Coming in Bndtools 3.0 and BeyondWhat's Coming in Bndtools 3.0 and Beyond
What's Coming in Bndtools 3.0 and Beyondnjbartlett
 
Java 7 Modularity: a View from the Gallery
Java 7 Modularity: a View from the GalleryJava 7 Modularity: a View from the Gallery
Java 7 Modularity: a View from the Gallerynjbartlett
 
Examples Of Creative Digital Solutions
Examples Of Creative Digital SolutionsExamples Of Creative Digital Solutions
Examples Of Creative Digital SolutionsCharlie Hunter-Schyff
 

Andere mochten auch (7)

Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 
Lean and Easy IoT Applications with OSGi and Eclipse Concierge
Lean and Easy IoT Applications with OSGi and Eclipse ConciergeLean and Easy IoT Applications with OSGi and Eclipse Concierge
Lean and Easy IoT Applications with OSGi and Eclipse Concierge
 
Easy-peasy OSGi Development with Bndtools - Neil Bartlett
Easy-peasy OSGi Development with Bndtools - Neil BartlettEasy-peasy OSGi Development with Bndtools - Neil Bartlett
Easy-peasy OSGi Development with Bndtools - Neil Bartlett
 
Bndtools 101 - N Bartlett
Bndtools 101 - N BartlettBndtools 101 - N Bartlett
Bndtools 101 - N Bartlett
 
What's Coming in Bndtools 3.0 and Beyond
What's Coming in Bndtools 3.0 and BeyondWhat's Coming in Bndtools 3.0 and Beyond
What's Coming in Bndtools 3.0 and Beyond
 
Java 7 Modularity: a View from the Gallery
Java 7 Modularity: a View from the GalleryJava 7 Modularity: a View from the Gallery
Java 7 Modularity: a View from the Gallery
 
Examples Of Creative Digital Solutions
Examples Of Creative Digital SolutionsExamples Of Creative Digital Solutions
Examples Of Creative Digital Solutions
 

Ähnlich wie Using GPUs to Achieve Massive Parallelism in Java 8

DIY Analytics with Apache Spark
DIY Analytics with Apache SparkDIY Analytics with Apache Spark
DIY Analytics with Apache SparkAdam Roberts
 
DESY's new data taking and analysis infrastructure for PETRA III
DESY's new data taking and analysis infrastructure for PETRA IIIDESY's new data taking and analysis infrastructure for PETRA III
DESY's new data taking and analysis infrastructure for PETRA IIIUlf Troppens
 
Accelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsAccelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsIBM
 
Become an IBM Cloud Architect in 40 Minutes
Become an IBM Cloud Architect in 40 MinutesBecome an IBM Cloud Architect in 40 Minutes
Become an IBM Cloud Architect in 40 MinutesAndrew Ferrier
 
Java on IBM z15
Java on IBM z15Java on IBM z15
Java on IBM z15Joran Siu
 
Why z/OS is a great platform for developing and hosting APIs
Why z/OS is a great platform for developing and hosting APIsWhy z/OS is a great platform for developing and hosting APIs
Why z/OS is a great platform for developing and hosting APIsTeodoro Cipresso
 
Ingesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache OrcIngesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache OrcDataWorks Summit
 
Evolving a monolithic Java EE application to microservices
Evolving a monolithic Java EE application to microservicesEvolving a monolithic Java EE application to microservices
Evolving a monolithic Java EE application to microservicesErin Schnabel
 
DEV-1269: Best and Worst Practices for Deploying IBM Connections – IBM Conne...
DEV-1269: Best and Worst Practices for Deploying IBM Connections  – IBM Conne...DEV-1269: Best and Worst Practices for Deploying IBM Connections  – IBM Conne...
DEV-1269: Best and Worst Practices for Deploying IBM Connections – IBM Conne...panagenda
 
DEV-1185: IBM Notes Performance Boost - Reloaded – IBM Connect 2017
DEV-1185: IBM Notes Performance Boost - Reloaded – IBM Connect 2017DEV-1185: IBM Notes Performance Boost - Reloaded – IBM Connect 2017
DEV-1185: IBM Notes Performance Boost - Reloaded – IBM Connect 2017panagenda
 
IBM Notes Performance Boost - Reloaded (DEV-1185)
IBM Notes Performance Boost - Reloaded (DEV-1185)IBM Notes Performance Boost - Reloaded (DEV-1185)
IBM Notes Performance Boost - Reloaded (DEV-1185)Christoph Adler
 
Cognitive Computing in IBM Spectrum LSF
Cognitive Computing in IBM Spectrum LSFCognitive Computing in IBM Spectrum LSF
Cognitive Computing in IBM Spectrum LSFGabor Samu
 
Fnb optimizes retail banking product offers using real-time propensity models...
Fnb optimizes retail banking product offers using real-time propensity models...Fnb optimizes retail banking product offers using real-time propensity models...
Fnb optimizes retail banking product offers using real-time propensity models...Avsharn
 
Connect 2017 DEV-1420 - Blue Mix and Domino – Complementing Smartcloud
Connect 2017 DEV-1420 - Blue Mix and Domino – Complementing SmartcloudConnect 2017 DEV-1420 - Blue Mix and Domino – Complementing Smartcloud
Connect 2017 DEV-1420 - Blue Mix and Domino – Complementing SmartcloudMatteo Bisi
 
DEV-1268: IBM Connections Adminblast – IBM Connect 2017
DEV-1268: IBM Connections Adminblast – IBM Connect 2017DEV-1268: IBM Connections Adminblast – IBM Connect 2017
DEV-1268: IBM Connections Adminblast – IBM Connect 2017panagenda
 
Codemotion Rome 2015 Bluemix Lab Tutorial
Codemotion Rome 2015 Bluemix Lab TutorialCodemotion Rome 2015 Bluemix Lab Tutorial
Codemotion Rome 2015 Bluemix Lab Tutorialgjuljo
 
IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...
IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...
IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...Chris Miller
 
Good Design is Good Business: Business Design with RSA and SA
Good Design is Good Business: Business Design with RSA and SAGood Design is Good Business: Business Design with RSA and SA
Good Design is Good Business: Business Design with RSA and SARoger Snook
 
IC6284A - The Art of Choosing the Best Cloud Solution
IC6284A - The Art of Choosing the Best Cloud SolutionIC6284A - The Art of Choosing the Best Cloud Solution
IC6284A - The Art of Choosing the Best Cloud SolutionHendrik van Run
 

Ähnlich wie Using GPUs to Achieve Massive Parallelism in Java 8 (20)

DIY Analytics with Apache Spark
DIY Analytics with Apache SparkDIY Analytics with Apache Spark
DIY Analytics with Apache Spark
 
DESY's new data taking and analysis infrastructure for PETRA III
DESY's new data taking and analysis infrastructure for PETRA IIIDESY's new data taking and analysis infrastructure for PETRA III
DESY's new data taking and analysis infrastructure for PETRA III
 
Accelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUsAccelerating Machine Learning Applications on Spark Using GPUs
Accelerating Machine Learning Applications on Spark Using GPUs
 
Become an IBM Cloud Architect in 40 Minutes
Become an IBM Cloud Architect in 40 MinutesBecome an IBM Cloud Architect in 40 Minutes
Become an IBM Cloud Architect in 40 Minutes
 
Java on IBM z15
Java on IBM z15Java on IBM z15
Java on IBM z15
 
Why z/OS is a great platform for developing and hosting APIs
Why z/OS is a great platform for developing and hosting APIsWhy z/OS is a great platform for developing and hosting APIs
Why z/OS is a great platform for developing and hosting APIs
 
Ingesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache OrcIngesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache Orc
 
Evolving a monolithic Java EE application to microservices
Evolving a monolithic Java EE application to microservicesEvolving a monolithic Java EE application to microservices
Evolving a monolithic Java EE application to microservices
 
DEV-1269: Best and Worst Practices for Deploying IBM Connections – IBM Conne...
DEV-1269: Best and Worst Practices for Deploying IBM Connections  – IBM Conne...DEV-1269: Best and Worst Practices for Deploying IBM Connections  – IBM Conne...
DEV-1269: Best and Worst Practices for Deploying IBM Connections – IBM Conne...
 
DEV-1185: IBM Notes Performance Boost - Reloaded – IBM Connect 2017
DEV-1185: IBM Notes Performance Boost - Reloaded – IBM Connect 2017DEV-1185: IBM Notes Performance Boost - Reloaded – IBM Connect 2017
DEV-1185: IBM Notes Performance Boost - Reloaded – IBM Connect 2017
 
IBM Notes Performance Boost - Reloaded (DEV-1185)
IBM Notes Performance Boost - Reloaded (DEV-1185)IBM Notes Performance Boost - Reloaded (DEV-1185)
IBM Notes Performance Boost - Reloaded (DEV-1185)
 
predictor
predictorpredictor
predictor
 
Cognitive Computing in IBM Spectrum LSF
Cognitive Computing in IBM Spectrum LSFCognitive Computing in IBM Spectrum LSF
Cognitive Computing in IBM Spectrum LSF
 
Fnb optimizes retail banking product offers using real-time propensity models...
Fnb optimizes retail banking product offers using real-time propensity models...Fnb optimizes retail banking product offers using real-time propensity models...
Fnb optimizes retail banking product offers using real-time propensity models...
 
Connect 2017 DEV-1420 - Blue Mix and Domino – Complementing Smartcloud
Connect 2017 DEV-1420 - Blue Mix and Domino – Complementing SmartcloudConnect 2017 DEV-1420 - Blue Mix and Domino – Complementing Smartcloud
Connect 2017 DEV-1420 - Blue Mix and Domino – Complementing Smartcloud
 
DEV-1268: IBM Connections Adminblast – IBM Connect 2017
DEV-1268: IBM Connections Adminblast – IBM Connect 2017DEV-1268: IBM Connections Adminblast – IBM Connect 2017
DEV-1268: IBM Connections Adminblast – IBM Connect 2017
 
Codemotion Rome 2015 Bluemix Lab Tutorial
Codemotion Rome 2015 Bluemix Lab TutorialCodemotion Rome 2015 Bluemix Lab Tutorial
Codemotion Rome 2015 Bluemix Lab Tutorial
 
IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...
IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...
IBM Connect 2016 - Logging Wars: A Cross Product Tech Clash Between Experts -...
 
Good Design is Good Business: Business Design with RSA and SA
Good Design is Good Business: Business Design with RSA and SAGood Design is Good Business: Business Design with RSA and SA
Good Design is Good Business: Business Design with RSA and SA
 
IC6284A - The Art of Choosing the Best Cloud Solution
IC6284A - The Art of Choosing the Best Cloud SolutionIC6284A - The Art of Choosing the Best Cloud Solution
IC6284A - The Art of Choosing the Best Cloud Solution
 

Mehr von Dev_Events

Eclipse OMR: a modern, open-source toolkit for building language runtimes
Eclipse OMR: a modern, open-source toolkit for building language runtimesEclipse OMR: a modern, open-source toolkit for building language runtimes
Eclipse OMR: a modern, open-source toolkit for building language runtimesDev_Events
 
Eclipse MicroProfile: Accelerating the adoption of Java Microservices
Eclipse MicroProfile: Accelerating the adoption of Java MicroservicesEclipse MicroProfile: Accelerating the adoption of Java Microservices
Eclipse MicroProfile: Accelerating the adoption of Java MicroservicesDev_Events
 
From Science Fiction to Science Fact: How AI Will Change Our Approach to Buil...
From Science Fiction to Science Fact: How AI Will Change Our Approach to Buil...From Science Fiction to Science Fact: How AI Will Change Our Approach to Buil...
From Science Fiction to Science Fact: How AI Will Change Our Approach to Buil...Dev_Events
 
Blockchain Hyperledger Lab
Blockchain Hyperledger LabBlockchain Hyperledger Lab
Blockchain Hyperledger LabDev_Events
 
Introduction to Blockchain and Hyperledger
Introduction to Blockchain and HyperledgerIntroduction to Blockchain and Hyperledger
Introduction to Blockchain and HyperledgerDev_Events
 
Eclipse JDT Embraces Java 9 – An Insider’s View
Eclipse JDT Embraces Java 9 – An Insider’s ViewEclipse JDT Embraces Java 9 – An Insider’s View
Eclipse JDT Embraces Java 9 – An Insider’s ViewDev_Events
 
Node.js – ask us anything!
Node.js – ask us anything! Node.js – ask us anything!
Node.js – ask us anything! Dev_Events
 
Swift on the Server
Swift on the Server Swift on the Server
Swift on the Server Dev_Events
 
Being serverless and Swift... Is that allowed?
Being serverless and Swift... Is that allowed? Being serverless and Swift... Is that allowed?
Being serverless and Swift... Is that allowed? Dev_Events
 
Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...Dev_Events
 
Tools in Action: Transforming everyday objects with the power of deeplearning...
Tools in Action: Transforming everyday objects with the power of deeplearning...Tools in Action: Transforming everyday objects with the power of deeplearning...
Tools in Action: Transforming everyday objects with the power of deeplearning...Dev_Events
 
Microservices without Servers
Microservices without ServersMicroservices without Servers
Microservices without ServersDev_Events
 
The App Evolution
The App EvolutionThe App Evolution
The App EvolutionDev_Events
 
Building Next Generation Applications and Microservices
Building Next Generation Applications and Microservices Building Next Generation Applications and Microservices
Building Next Generation Applications and Microservices Dev_Events
 
Create and Manage APIs with API Connect, Swagger and Bluemix
Create and Manage APIs with API Connect, Swagger and BluemixCreate and Manage APIs with API Connect, Swagger and Bluemix
Create and Manage APIs with API Connect, Swagger and BluemixDev_Events
 
OpenWhisk - Serverless Architecture
OpenWhisk - Serverless Architecture OpenWhisk - Serverless Architecture
OpenWhisk - Serverless Architecture Dev_Events
 
Add Custom Model and ORM to Node.js
Add Custom Model and ORM to Node.jsAdd Custom Model and ORM to Node.js
Add Custom Model and ORM to Node.jsDev_Events
 
Adding User Management to Node.js
Adding User Management to Node.jsAdding User Management to Node.js
Adding User Management to Node.jsDev_Events
 
Creating Sentiment Line Chart with Watson
Creating Sentiment Line Chart with Watson Creating Sentiment Line Chart with Watson
Creating Sentiment Line Chart with Watson Dev_Events
 
Containers Lab
Containers Lab Containers Lab
Containers Lab Dev_Events
 

Mehr von Dev_Events (20)

Eclipse OMR: a modern, open-source toolkit for building language runtimes
Eclipse OMR: a modern, open-source toolkit for building language runtimesEclipse OMR: a modern, open-source toolkit for building language runtimes
Eclipse OMR: a modern, open-source toolkit for building language runtimes
 
Eclipse MicroProfile: Accelerating the adoption of Java Microservices
Eclipse MicroProfile: Accelerating the adoption of Java MicroservicesEclipse MicroProfile: Accelerating the adoption of Java Microservices
Eclipse MicroProfile: Accelerating the adoption of Java Microservices
 
From Science Fiction to Science Fact: How AI Will Change Our Approach to Buil...
From Science Fiction to Science Fact: How AI Will Change Our Approach to Buil...From Science Fiction to Science Fact: How AI Will Change Our Approach to Buil...
From Science Fiction to Science Fact: How AI Will Change Our Approach to Buil...
 
Blockchain Hyperledger Lab
Blockchain Hyperledger LabBlockchain Hyperledger Lab
Blockchain Hyperledger Lab
 
Introduction to Blockchain and Hyperledger
Introduction to Blockchain and HyperledgerIntroduction to Blockchain and Hyperledger
Introduction to Blockchain and Hyperledger
 
Eclipse JDT Embraces Java 9 – An Insider’s View
Eclipse JDT Embraces Java 9 – An Insider’s ViewEclipse JDT Embraces Java 9 – An Insider’s View
Eclipse JDT Embraces Java 9 – An Insider’s View
 
Node.js – ask us anything!
Node.js – ask us anything! Node.js – ask us anything!
Node.js – ask us anything!
 
Swift on the Server
Swift on the Server Swift on the Server
Swift on the Server
 
Being serverless and Swift... Is that allowed?
Being serverless and Swift... Is that allowed? Being serverless and Swift... Is that allowed?
Being serverless and Swift... Is that allowed?
 
Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...
 
Tools in Action: Transforming everyday objects with the power of deeplearning...
Tools in Action: Transforming everyday objects with the power of deeplearning...Tools in Action: Transforming everyday objects with the power of deeplearning...
Tools in Action: Transforming everyday objects with the power of deeplearning...
 
Microservices without Servers
Microservices without ServersMicroservices without Servers
Microservices without Servers
 
The App Evolution
The App EvolutionThe App Evolution
The App Evolution
 
Building Next Generation Applications and Microservices
Building Next Generation Applications and Microservices Building Next Generation Applications and Microservices
Building Next Generation Applications and Microservices
 
Create and Manage APIs with API Connect, Swagger and Bluemix
Create and Manage APIs with API Connect, Swagger and BluemixCreate and Manage APIs with API Connect, Swagger and Bluemix
Create and Manage APIs with API Connect, Swagger and Bluemix
 
OpenWhisk - Serverless Architecture
OpenWhisk - Serverless Architecture OpenWhisk - Serverless Architecture
OpenWhisk - Serverless Architecture
 
Add Custom Model and ORM to Node.js
Add Custom Model and ORM to Node.jsAdd Custom Model and ORM to Node.js
Add Custom Model and ORM to Node.js
 
Adding User Management to Node.js
Adding User Management to Node.jsAdding User Management to Node.js
Adding User Management to Node.js
 
Creating Sentiment Line Chart with Watson
Creating Sentiment Line Chart with Watson Creating Sentiment Line Chart with Watson
Creating Sentiment Line Chart with Watson
 
Containers Lab
Containers Lab Containers Lab
Containers Lab
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Kürzlich hochgeladen (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Using GPUs to Achieve Massive Parallelism in Java 8

  • 1. © 2017 IBM Corporation 1 ● Sharing observations and our progress ● How to get started ● The good and the bad ● Interesting related projects ● Plenty of code to show you ● Tips for avoiding common problems Massive parallelism with GPUs in Java 8 Adam Roberts IBM Runtimes, Hursley, UK
  • 2. © 2017 IBM Corporation Important disclaimers Copyright © 2017 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS document is distributed "AS IS" without any warranty, either express or implied. In no event shall IBM be liable for any damage arising from the use of this information, including but not limited to, loss of data, business interruption, loss of profit or loss of opportunity. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided. Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the  customer is in compliance with any law. Information within this presentation is accurate to the best of the author's knowledge as of the 17th of March 2017
  • 3. © 2017 IBM Corporation Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM expressly disclaims all warranties, expressed or implied, including but not limited to, the implied warranties of merchantability and fitness for a particular purpose. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. IBM, the IBM logo, ibm.com, Bluemix, Blueworks Live, CICS, Clearcase, DOORS®, Enterprise Document Management System™, Global Business Services ®, Global Technology Services ®, Information on Demand, ILOG, LinuxONE™, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, SoDA, SPSS, StoredIQ, Tivoli®, Trusteer®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners: oher names mentioned here include AMD, Nvidia, Tensorflow. Aparapi, Jcuda, cuDNN, cuBLAS, Project Sumatra, OpenJDK, CERN, Geant, AlphaGo, Oak Ridge, Titan, Lenovo, Tesla, Netflix, Rice University, Devoxx, DeepLearning4j. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml. Databricks is a registered trademark of Databricks, Inc. Apache Spark, Apache Cassandra, Apache Hadoop, Apache Maven, Spark, Apache, any other Apache project mentioned here and the Apache product logos including the Spark logo are trademarks of The Apache Software Foundation.
  • 4. © 2017 IBM Corporation 1) No liability accepted for any of the code I'll be sharing today and providing with this presentation at the end – to be used at your own risk! My sample code isn't for production use – I've skimmed on plenty of application hardening techniques (checking error codes, using final, correct visibility modifiers etc) 2) Experiments with lots of threads meaning potentially lots of problems -really- don't run the parallel example trying to use 50,000+ CPU threads coming up – and if we make it to the end, don't run the “kernelception” program either 3) Messing around with graphics drivers on your work laptop isn't a good idea unless you have a good backup in place (my laptop was a headless server for a few days), I changed BIOS settings and made a few mistakes along the way – you have been warned!
  • 5. © 2017 IBM Corporation 5 What I won't cover ✗ In-depth look at alternatives, I'll be talking about Nvidia's CUDA with IBM's SDK for Java mainly ✗ In-depth debugging and profiling ✗ Real impressive applications – I'll be talking about how to get started to give you ideas ✗ GPUs may be a useful fit for that simple processing task you run on large amounts of data ✗ Java basics – assuming you know about Java options, building and running and you're now interested in doing lots of operations at once as fast as possible
  • 6. © 2017 IBM Corporation 6 z13 BigInsights How popular is Java?
  • 7. © 2017 IBM Corporation 7 Simple Java only example inspired by a stackoverflow post titled “custom thread pool in Java 8 parallel stream”: the goal here isn't to finish processing elements fast – it's to see how many threads reportedly get run in parallel and to see how many threads I can specify to use before the JVM crashes How many threads can we run at once in Java? • •
  • 8. © 2017 IBM Corporation 8 • numElements set to 50, numThreads set to 5 Takes ten seconds to finish, no problems
  • 9. © 2017 IBM Corporation 9 numElements set to 50, numThreads set to 50 also Finishes instantly – no problems
  • 10. © 2017 IBM Corporation 10 No problems here either... numElements set to 50,000, numThreads reduced to 10
  • 11. © 2017 IBM Corporation 11 Faster, constant output, still no problems, laptop getting noisy now... numElements set to 50,000, numThreads set to 1,000...
  • 12. © 2017 IBM Corporation 12 numElements 50,000, numThreads 50,000? ✔ Laptop preparing to take off from my desk ✔ No native memory to create new threads ✔ Unable to terminate the process in my shell - ^C's – they do nothing! ✔ Mouse stuttering around...can't...click...the x...now curious what happens ✔ JVM trying to create coredumps, javacores repeatedly: trying to eat up my disk space - no memory to create those anyway ✔ LibreOffice crashes, lost unsaved work (past experiments needing to be redone) ✔ Still can't ctrl-c to stop everything ✔ Can't launch any new processes (no chance of launching system monitor) ✔ Wanted to get a printscreen – no memory available for that either - reboot
  • 13. © 2017 IBM Corporation 13 Reaching out to GPU(s) for more processing power from Java • We'll struggle trying to run thousands of threads at once in one JVM (using a single machine and a single CPU with many cores), but using GPUs can sometimes be of use Use cases for GPUs share typically share these common themes, we want to: ● Achieve results fast ● Execute many threads to quickly process data for my “easily* parallelisable” operations ● Handle large amounts of data Great for machine learning: quickly compute and store models to use later
  • 14. © 2017 IBM Corporation 14 AlphaGo beating a Go champion: 1,202 CPUs, 176 GPUs Titan: 18,688 GPUs, 18,688 CPUs CERN: reported to be using GPUs Oak Ridge, IBM “the world's fastest supercomputers by 2017”: two, $325m Databricks: recent blog post mentions deep learning with GPUs and Spark Who's using GPUs already? Only public knowledge provided here, certainly many more than this!
  • 15. © 2017 IBM Corporation 15 Recent AI vs Poker win (from top500: “bridges-supercomputer” article here: mentions using 64 Nvidia P100 GPUs! Recent Amazon cloud offering: GPUs as a service Nvidia email as part of the accelerated computing newsletter mentions… ● Deep learning to combat asteroids ● Detecting road lanes with deep learning ● Algorithm to identify skin cancer ● Lip reading AI more accurate than humans ● Life-changing wearable for the blind Lots more success stories – what makes a GPU useful, worth the hype?
  • 16. © 2017 IBM Corporation 16 GPUs excel at executing many of the same operations at once (Single Instruction Multiple Data programming) We'll program using CUDA or OpenCL – like C and C++ but not quite the same (nuances like <<< and >>> for kernels in CUDA) and we can write JNI code to access data in our Java world using the GPU We'll run code on computers that are shipped with graphics cards, there are free CUDA drivers for x86-64 Windows, Linux, and IBM's Power LE, OpenCL drivers, SDK and source also widely available CPUGPU
  • 17. © 2017 IBM Corporation 17 What types of GPU can I get? Does it matter? “Graphics adapters” you can plug a monitor into 2 to 4 GB~ GDDR5 memory < a thousand processing cores Clock speed ~ 1250 mhz Typical in laptops, desktop gaming computers * For this presentation, experiments (unless otherwise stated) were performed on my Lenovo p50 laptop (discrete graphics mode set in the BIOS, CUDA 7.5, RHEL 7.3, 32 GB RAM, M1000M Quadro GPU, 8 core Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz processor, ext4 filesystem)
  • 18. © 2017 IBM Corporation 18 HPC cards like the Tesla series GDDR5 memory - typically 8G to 24G 1-5 thousand processing cores Offering teraflops of performance ~500 GB/sec max memory bandwidth* Remember you're going to be limited by the PCIe bus if it's between CPU and GPU, for CUDA devices, use deviceQuery, bandwidthTest applications) 300W~ thermal design power rating
  • 19. © 2017 IBM Corporation 19 Running deviceQuery on my development laptop Application provided with CUDA samples from Nvidia Device 0: "Quadro M1000M" CUDA Driver Version / Runtime Version 7.5 / 7.5 CUDA Capability Major/Minor version number: 5.0 Total amount of global memory: 2047 MBytes (2146762752 bytes) ( 4) Multiprocessors, (128) CUDA Cores/MP: 512 CUDA Cores GPU Max Clock rate: 1072 MHz (1.07 GHz) Memory Clock rate: 2505 Mhz Memory Bus Width: 128-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Quadro M1000M Result = PASS
  • 20. © 2017 IBM Corporation 20 Running bandwidthTest [CUDA Bandwidth Test] - Starting... Running on... Device 0: Quadro M1000M Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 12152.3 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 12225.6 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 66464.2 Device to device is quick but the host and device interchange is far slower Compare this to direct memory access…
  • 21. © 2017 IBM Corporation 21 CPU, RAM, OS, kernel info lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 94 Model name: Intel(R) Core(TM) i7-6820HQ CPU, 2.70GHz Stepping: 3 CPU MHz: 899.964 BogoMIPS: 5424.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0-7 cat /proc/meminfo MemTotal: 32391628 kB uname -a Linux devoxx 3.10.0-514.6.1.el7.x86_64 #1 SMP Sat Dec 10 11:15:38 EST 2016 x86_64 x86_64 x86_64 GNU/Linux Df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_oc2660338613-lv_root 438G 81G 336G 20% / (it's an SSD) cat /etc/*-release Linux Client for e-business (RHEL) 7.3 Open Client RHEL 7 4.30 NAME="Red Hat Enterprise Linux Workstation" VERSION="7.3 (Maipo)"
  • 22. © 2017 IBM Corporation 22 Workload characteristics a GPU can excel at Data ● We can process lots of primitive types at once ● ints, longs, doubles, shorts, floats – perhaps used in... ● Matrix multiplication (dot product for ML?) ● Simple transforms (change our masses of longs by a known offset amount?) ● Find a pattern in the data: count occurrences of a certain string from Wikipedia dumps Operations ● Keep it simple – without branching and complexity ● Great for arithmetic ops (very fast floating point ops...)
  • 23. © 2017 IBM Corporation 23 Workload characteristics a GPU won't be good for Data ● The data we need isn't “self contained” – we can't send down one whole block of data and get meaningful results as we depend on data elsewhere...lots of copying back and forth Operations ● Non-arithmetic based – code that touches files, uses the network, manipulates objects...stick to the maths ● Involves new object creation or throwing exceptions – more on this later… ● Using threads for different instructions simultaneously – try to keep it simple without lots of if/elses
  • 24. © 2017 IBM Corporation 24 Assume we have an integer array declared in a .cu file called myData (either initialized with malloc or on the stack – a regular C style array) 1) Declare a new variable of the same type e.g. int* myDataOnGPU 2) Allocate space on the GPU (device side) using cudaMalloc passing in the address of myDataOnGPU and how many bytes to reserve as a parameter (e.g. cudaMalloc(&myDataOnGPU), 400) 3) Copy myData from the host to your allocated space (myDataOnGPU) using cudaMemcpyHostToDevice 4) Process your data on the GPU in a kernel (we use <<< and >>>) 5) Copy the result back (what's at myDataOnGPU replaces myData on the host) using cudaMemcpyDeviceToHost How do we use a GPU – basic principles to know
  • 25. © 2017 IBM Corporation 25 __global__ void addingKernel(int* array1, int* array2){ array1[threadIdx.x] += array2[threadIdx.x]; } __global__ : it's a function we can call on the host (CPU), it's available to be called from everywhere. __device__ and __host__ also exist How is the data arranged and how can I access it? Sequentially, a kernel runs on a grid (numBlocks X numThreads) and this is how we can run many threads that work on different parts of the data Int* is a regular pointer to integers we've copied to the GPU threadIdx.x: We can use this built-in variable inside of kernels an index to our array, remember lots of threads run on the GPU and this can be our way to access each unique item – if we run a kernel <<<1, 256>>>, that means one block and 256 threads will run each time you call the kernel
  • 26. © 2017 IBM Corporation 26 Multiprocessors (also known as streaming multiprocessors or stream processors): these execute one or more thread blocks CUDA core: they execute the threads themselves Threads on a GPU: many more are available than with CPUs and these are organised into the blocks Kernel: a function we'll run on the GPU How many threads can I really run at once? Multiprocessor count X their limit e.g. 4 * 2048 with 512 CUDA cores for me A Tesla K80m has 26 multiprocessors and 4992 CUDA cores (2496 per GPU), 2048 threads per multiprocessor also. Other threads wait to be executed
  • 27. © 2017 IBM Corporation 27 All kernels must be launched with grid dimensions specified Grid: logical 3d representation of how threads can be run on a given GPU – a kernel runs on a grid. This grid has potentially many blocks with threads organised “inside” each block (actually get run on the MP) Our GPU functions (kernels) run on one of these grids and the dimensions include how many blocks and threads a kernel should run The nvidia-smi command tells you about your GPU's limits – know these to prevent launch configuration problems A good starting point is to pick 512 for the number of threads and the number of blocks varies depending on your problem size – then launch multiple kernels in a tight loop modifying the offset to operate on different portions of the data How much do I need to know?
  • 28. © 2017 IBM Corporation 28 An example to do that exactly that using Java: int log2BlockDim = 9; int numBlocks = (numElements + 511) >> log2BlockDim; int numThreads = 1 << log2BlockDim; Size Blocks Threads 500 1 512 1,024 2 512 32,000 63 512 64,000 125 512 100,000 196 512 512,000 1000 512 1,024,000 2000 512
  • 29. © 2017 IBM Corporation 29 Show me the code – simplest example, only CUDA#include <cuda.h> #include <stdio.h> const int NUM_ELEMENTS = 5; __global__ void addToMe(int* someInts, int amountToAdd) { someInts[threadIdx.x] += amountToAdd; } // This is in foo.cu → nvcc foo.cu → ./a.out int main() { int* myHostInts = (int*) malloc(sizeof(int) * NUM_ELEMENTS); for (int i = 0; i < NUM_ELEMENTS; i++) { myHostInts[i] = i; } int* myDeviceInts; const int numBytes = NUM_ELEMENTS * sizeof(int); cudaMalloc(&myDeviceInts, numBytes); cudaMemcpy(myDeviceInts, myHostInts, numBytes, cudaMemcpyHostToDevice); int numBlocks = (NUM_ELEMENTS / 256) + 1; addToMe<<<numBlocks, 256>>>(myDeviceInts, 10); cudaMemcpy(myHostInts, myDeviceInts, numBytes, cudaMemcpyDeviceToHost); // Tidy up after ourselves as good practice cudaFree(myDeviceInts); return EXIT_SUCCESS; } No bounds checking! Not required but can lead to problems later Printing threadIdx.x here will print 0 to 255 Blocks = a group of threads I'll use a 2D grid (just lots of blocks/threads) in this presentation Look at our kernel dimensions numBlocks will be 1 256 is the number of threads This is how we control how many threads to run
  • 30. © 2017 IBM Corporation 30 ● [Java] We have an integer array on the Java heap: myData – we want to process it somehow using a GPU ● [Java] Create a native method (Java/Scala): no body required ● [JNI] Write .cpp or .c code with a matching signature for your native method (use javah on your built Java class as a starting point), in this native code, use JNI methods to get a pointer to your data, with this pointer, we can figure out how much memory we need. Call your method that's in a .cu file that you're about to create... ● [CUDA] Allocate space on the GPU (device side) using cudaMalloc ● [CUDA] Copy myData to your allocated space (myDataOnTheGPU) using cudaMemcpyHostToDevice ● [CUDA] Process your data on the GPU in a kernel (look for <<< and >>>) ● [CUDA] Copy the result back (what's now at myDataOnTheGPU replaces myData on the host) using cudaMemcpyDeviceToHost ● [JNI] Release the elements (updating your JNI pointer so the data in our JVM heap is now the result) ● [Java] Interact with your data normally as you're back in the Java world How might we use a GPU with Java or Scala?
  • 31. © 2017 IBM Corporation 31 Working example: file overview and script
  • 32. © 2017 IBM Corporation 32 Working example: Java code
  • 33. © 2017 IBM Corporation 33 Working example: header file
  • 34. © 2017 IBM Corporation 34 Working example: c++ code (matching the header)
  • 35. © 2017 IBM Corporation 35 Working example: .cu code (and a simple kernel)
  • 36. © 2017 IBM Corporation 36 Working example: checking the results
  • 37. © 2017 IBM Corporation 37 Pitfalls to look out for objdump mysharedlibrary.so -t | grep yourmethodname is very useful for unsatisfied link errors... [aroberts@devoxx withjava]$ objdump lib/devoxx.so -t | grep "addX" 00000000000053c4 g F .text 000000000000005f _Z28Java_SimpleJava_addXToMyIntsP7JNIEnv_P7_jclassP10_jintArrayi Name mangling can occur (use “extern C {} blocks” in your .cpp and .cu code) [aroberts@devoxx withjava]$ ./BuildAndRun.sh Unhandled exception Type=Segmentation error vmState=0x00000000 Unsafe world now – check your memory accesses - ensure all of your pointers are still valid, printfs and gdb for debugging, Nsight/cuda-gdb/cuda-memcheck for CUDA specific help, more on this later
  • 38. © 2017 IBM Corporation 38 printf statements added... [aroberts@devoxx withjava]$ ./BuildAndRun.sh getting elements got em! launching kernel... addToMe+0x20 (0x00007F48441B630F [devoxx.so+0x530f]) Java_SimpleJava_addXToMyInts+0x5c (0x00007F48441B6440 [devoxx.so+0x5440]) (0x00007F4854264F9B [libj9vm28.so+0x8ff9b]) Unhandled exception ^^ Check your memory accesses! Remember to call your kernel with the <<< and >>> syntax (in a .cu file) Remember to use your device pointer variable as the parameter in your kernel (not the host one) - or you won't be able to modify your data (it'll act on nothing – the kernel will still launch but your data will remain unchanged) You can add printf statements inside of your kernels (printing threadIdx.x which you're likely using as an index into an array is a good idea) Yes, you should add bounds checking inside of your kernels Yes, you should check return codes and use cudaError_t
  • 39. © 2017 IBM Corporation 39 Is there a simpler way? Sticking to Java as much as possible ● Lots of Java projects we want to use ● Error checking ● Type safety ● Debugging tools (core dumps, javacores, system dumps, GCMV, MAT)... ● Profiling tools (Healthcenter, jprof)... ● JIT compiler and a garbage collector ● Portability - until you “go native”, mix byte-ordering across machines while using sun.misc.unsafe, use other internal APIs relying on field names, find there's no JRE available... The approaches we've taken ● Java Class Library changes ● Just-In-Time Compiler changes ● CUDA4J API ● Apache Spark changes (runs in JVMs)
  • 40. © 2017 IBM Corporation 40 -Dcom.ibm.gpu.enable/enforce/disable/verbose 40,000,000 400,000,000 Ints sorted per second Array length 400m per sec 40m per sec Sorting throughput for ints 30,000 300,000 3,000,000 30,000,000 300,000,000 Details online here Making it easier: Java class library modification
  • 41. © 2017 IBM Corporation 41 -Xjit:enableGPU=”{default, verbose”} Can be forced with “{enforce”} Making it easier: Java JIT compiler modification Using three arrays of randomly generated doubles: output, firstArray, secondArray – all of size ROWS (2048 here) Use an IntStream and specify our JIT option Primitive types can be used (byte, char, short, int, float, double, long) Run this inside a loop for an easily reproducible example – JIT must be hot to make an impact
  • 42. © 2017 IBM Corporation 42 Results on my laptop [IBM GPU JIT]: Device Number 0: name=Quadro M1000M, ComputeCapability=5.0 Setting up our arrays, size is 2048x2048 Done setting up! Starting the GPU enabled lambda, running GPU enabled lambda, parallelism: 1 End time: 42575.864909 msec Starting the GPU enabled lambda, running GPU enabled lambda, parallelism: 1 End time: 41080.132863 msec Starting the GPU enabled lambda, running GPU enabled lambda, parallelism: 1 [IBM GPU JIT]: [time.ms=1489774852380]: Launching parallel forEach in com/ibm/MatMultiExample/MatMulti.runGPULambda()V at line 139 on GPU [IBM GPU JIT]: [time.ms=1489774853402]: Finished parallel forEach in com/ibm/MatMultiExample/MatMulti.runGPULambda()V at line 139 on GPU End time: 1042.937686 msec With no JIT options provided, over 100 iterations (instead of just five) I still achieve a best time of 42 seconds. With more threads (setting it to 8 or 32, not 1 by modifying the fork join common property parallelism) my best time is 32 seconds – still much slower
  • 43. © 2017 IBM Corporation 43 Measured performance improvement with a GPU using four programs 1-CPU-thread sequential execution 160-CPU-thread parallel execution Experimental environment used IBM Java 8 Service Release 2 for PowerPC Little Endian Two 10-core 8-SMT IBM POWER8 CPUs at 3.69 GHz with 256GB memory (160 hardware threads in total) with one NVIDIA Kepler K40m GPU (2880 CUDA cores in total) at 876 MHz with 12GB global memory (ECC off) Performance of GPU enabled lambdas
  • 44. © 2017 IBM Corporation 44 This shows GPU execution time speedup amounts compared to what's in blue (1 CPU thread) and yellow (160 CPU threads) The higher the bar, the bigger the speedup!
  • 45. © 2017 IBM Corporation 45 Name Summary Data size Data type MM Dense matrix multiplication: C = A x B 1024 x 1024 (1m) items double SpMM As above but sparse 500k x 500k (250m) items double Jacobi2d Solve an equation using the Jacobi method 8192 x 8192 (67m) items double LifeGame Conway's Game of Life with 10k iterations 512 x 512 (262k) items byte
  • 46. © 2017 IBM Corporation 46 bytecodes intermediate representation optimizer CPU GPU code generator code generator PTX ISACPU native As the JIT compiles a stream expression we can identify candidates for GPU off-loading ● Data copied to and from the device implicitly ● Java operations mapped to GPU kernel operations ● JIT takes care of GPU data alignment, cache management ● Optimizes data transfer ● Manages multiple devices Advantages ● Reuses standard Java idioms, so no new API is required ● Preserves standard Java semantics ● No knowledge of GPU programming model required by the application developer ● Takes care of low level details: GPU devices capabilities, etc. ● Chooses optimal execution mode: CPU, GPU, or SIMD ● Future performance improvements in the JIT do not require code changes
  • 47. © 2017 IBM Corporation JVM: • Class loading • Method resolution • Object creation and GC • Exception handling Java array CPU Redirection to CPU (at compile or runtime) Copy over PCIe GPU copy of Java array • Optimized lambda code executed by multiple threads in a data parallel manner • Exception detection GPU The JIT compiler will check that the lambda expression satisfies the following criteria: • Only accesses primitive types, and one-dimensional arrays of primitive types • No access to static scalar variables: only locals, parameters, or instance variables • No unresolved or native methods • No creating new heap Objects (new ...), exceptions, (throw …), or instanceof • Intermediate stream operations like map or filter are not supported Limitations GPU memory isn't an extension of the Java heap
  • 48. © 2017 IBM Corporation 48 • JIT applies various performance heuristics to determine execution mode of the lambda expression (sequential, fork-join, GPU, or SIMD) • Heuristics depend on numerous factors and may change in the future to become more accurate, to deal with new architecture characteristics, etc • Currently, they are relatively conservative • We will work on new heuristics based on customer feedback • To observe if forEach was sent to GPU use –Xjit:enableGPU={verbose} • To override performance heuristics use –Xjit:enableGPU={enforce} • For combining options: -Xjit:enableGPU=”enforce|verbose” will work: the quotes are important lest your bash shell interpret | as a pipe! • Give it a go for yourself keeping the criteria for code to be eligible for GPU offloading in mind • We are using NVVM IR JIT based optimizations: performance heuristics
  • 49. © 2017 IBM Corporation 49 Production ready and supported by IBM – used to manipulate GPU devices ● Compared to Jcuda: no arbitrary and unrestricted use of Pointer(long), feels more like Java instead of C Write your CUDA kernel (yes, the hard part!) and compile it into a fat binary nvcc --fatbin AdamKernel.cu Add your Java code import com.ibm.cuda.*; import com.ibm.cuda.CudaKernel.*; Load your fat binary (module loading code at the end of this presentation) module = new Loader().loadModule("AdamDoubler.fatbin",device); Build and run as you would any other Java application Making it easier: CUDA4J API
  • 50. © 2017 IBM Corporation 50 CudaDevice a CUDA capable GPU device CudaStream a sequence of operations on the GPU CudaBuffer a region of memory on the GPU CudaModule user library of kernels to load into GPU CudaKernel launching a device function CudaFunction a kernel's entry point CudaEvent timing and synchronization CudaException when something goes wrong There are times when you want this low level GPU control from Java ● We developed an API that reflects the concepts familiar in CUDA programming ● Makes use of Java exceptions, automatic resource management, etc. ● Handles copying data to/from the GPU, flow of control from Java to GPU and back ● Ability to invoke existing GPU module code from Java applications e.g. Thrust CUDA4J class mapping
  • 51. © 2017 IBM Corporation 51 Only doubling integers; could be any use case where we're doing the same operation to lots of elements at once Full code listing at the end, Javadocs: search IBM Java 8 API com.ibm.cuda * Tip: the offsets are byte offsets, so you'll want your index in Java * the size of the object! module = new Loader().loadModule("AdamDoubler.fatbin", device); kernel = new CudaKernel(module, "Cuda_cuda4j_AdamDoubler_Strider"); stream = new CudaStream(device); numElements = 100; myData = new int[numElements]; Util.fillWithInts(myData); CudaGrid grid = Util.makeGrid(numElements, stream); buffer1 = new CudaBuffer(device, numElements * Integer.BYTES); buffer1.copyFrom(myData); Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements); kernel.launch(grid, kernelParams); buffer1.copyTo(myData); If our dynamically created grid dimensions are too big we need to break down the problem and use the slice* API: doChunkingProblem() Our kernel, compiles into AdamDoubler.fatbin
  • 52. © 2017 IBM Corporation 52 ● Integrating CUDA GPU offloading support into your existing Java applications without needing to worry about JNI, makefiles, managing GPU memory and writing C++ code (you still need to write your kernel) ● Identify your most commonly used functions as candidates (simple manual profiling or using tools such as Healthcenter for method profiling) ● Tinker with heuristics and benchmark new capabilities ● Be wary of the GPU limitations (e.g device memory amount, max grid size – may need to chunk up your problem) Where would this be useful?
  • 53. © 2017 IBM Corporation 53 ● Open source project (the most active for big data) offering distributed... ● Machine learning ● Graph processing ● Core operations (map, reduce, joins) ● SQL syntax with DataFrames/Datasets ● Many input formats supported e.g Parquet, JSON, files stored on HDFS you can parse trivially, CSV with a Databricks package ● Interoperability with Kafka, Hive, many more (see Apache Bahir also) ● Compression codecs and automatic usage, fast serialization with Kryo ● Offers scalability and resiliency ● Lots of Scala – so runs in JVMs (exploits sun.misc.unsafe heavily) ● Python, R, Scala and Java APIs ● Eligible for our Java based optimisations Ask after the talk for more details – my current role involves contributing, fixing, evangelising Spark as well as producing demos and working on customer problems – especially interested in data visualization tools Improving Apache Spark
  • 54. © 2017 IBM Corporation 54 What machine learning algorithms? Popular algorithms that'll run in a (potentially) distributed manner include ● Alternating least squares ● K-means ● Naive Bayes ● Logistic regression ● Random forests ● Decision trees ● Principal component analysis
  • 55. © 2017 IBM Corporation 55 ● Recommendation algorithms such as ● Alternating Least Squares ● Movie recommendations on Netflix? ● Recommended purchases on Amazon? ● Similar songs with Spotify? ● Recommended videos on YouTube? ● Clustering algorithms such as ● K-means (unsupervised learning (no labels, cheap)) ● Produce clusters from data to determine which cluster a new item can be categorised as ● Identify anomalies: transaction fraud or erroneous data ● Classification algorithms such as ● Logistic regression ● Create a model that we can use to predict where to plot the next item in a sequence (above or below our line of best fit) ● Healthcare: predict adverse drug reactions based on known interactions between similar drugs ● Spam filter (binomial classification) Known good candidates
  • 56. © 2017 IBM Corporation 56 An example: we have the following .csv file for bands.. <username, band name, band genre (a feature), rating> Adam,ACoolBand1,AGenre,5 Adam,ACoolBand2,AGenre,5 Adam,ACoolBand3,AGenre,5 George,ACoolBand1,AGenre,5 George,ACoolBand2,AGenre,5 George,ACoolBand3,AGenre,5 George,ACoolBand4,AGenre,5 So if we were to guess if Adam likes ACoolBand4 as well, the score would be very close to 5 – we can infer it based on already known observations How Alternating Least Squares works Very much simplified, ALS works by factorizing the rating matrix and minimising the loss on observed ratings (our ratings matrix will be sparse and we want to complete it – see “CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs. Nvidia GTC 2016 talk” by Wei Tan for an excellent summary
  • 57. © 2017 IBM Corporation 57 But what if there's a band way down the list in a place we can't fit into memory? ● Zack_Zwick: ObscureBand27, AGenrel, 5 How can we infer that Adam will also probably like ObscureBand27? ● We still want to be using GPUs (and we want the results fast – Adam's using a premium online service and doesn't want to wait hours to get a good match) ● Covered in the paper I cite next (Tan. Wei (IBM), Cao. Liaoliang (Yahoo!, IBM at the time of the work done), Fong. Liana (IBM), to summarise: “cuMF first generates a partition scheme, planning which partition to send to which GPU in what order. With this knowledge in advance, cuMF uses separate CPU threads to preload data from disk to host memory, and separate CUDA streams to preload from host memory to GPU memory. By this proactive and asynchronous data loading, we manage to handle out-of-core problems with close-to-zero data loading time except for the first load.” ● https://github.com/cuMF/cumf_als explains how to run this in batches
  • 58. © 2017 IBM Corporation 58 ● Under the covers optimisation, set the spark.mllib.ALS.useGPU property ● Full paper: http://arxiv.org/abs/1603.03820 ● Full implementation for raising issues and giving it a try for yourself: https://github.com/IBMSparkGPU, with 1.5 gb of a Netflix dataset: Experiment 12 threads, CPU 64 threads, CPU 2x GPUs Time 676 seconds N/A 140 seconds Our implementation is open source and cited above, we used: 2x Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz, 16 cores in the machine (SMT-2), 256 GB RAM vs 2x Nvidia Tesla K80Ms. Also available for IBM Power LE. Our approach for Apache Spark
  • 59. © 2017 IBM Corporation 59 Implemented the vanilla C++/Java/CUDA way so this would work with any JDK (tiny amount of C++ code and lots of CUDA for Spark – we only override one function) ● modified the existing ALS (.scala) implementation's computeFactors method ● added code to check if spark.mllib.ALS.useGPU is set ● if set we'll then call our native method written to ue JNI (.cpp) ● our JNI method calls native CUDA (.cu) method ● CUDA used to send our data to the GPU, calls our kernel, returns the results over JNI back to the Java heap Bundled with our Spark distribution and the shared library is included: libGPUALS.so ● Remember this will require the CUDA runtime (libcudart) and a capable GPU ALS.scala computeFactors CuMFJNIInterface.cpp ALS.cu libGPUALS.so
  • 60. © 2017 IBM Corporation 60 We can send generated code to GPUs and alter the code that's generated to conform to certain characteristics... Input: user application using Spark DataFrame or Dataset API (SQL-like syntax, perform queries on data stored in tables) ✔ Spark with Tungsten. Uses UnsafeRow and, sun.misc.unsafe, idea is to bring Spark closer to the hardware than previously, exploit CPU caches, improved memory and CPU efficiency, reduce GC times, avoid Java object overheads – good deep dive here ✔ Spark with Catalyst. Optimiser for Spark SQL APIs, good deep dive here, transforms a query plan (abstraction of a user's program) into an optimised version, generates optimised code with Janino compiler ✔ Spark with our changes: Java and core Spark class optimisations, optimised JIT More pervasive GPU opportunities for Spark
  • 61. © 2017 IBM Corporation 61 Output: generated code able to leverage auto-SIMD and GPUs We want generated code that: ✔ has a counted loop, e.g. one controlled by an automatic induction variable that increases from a lower to an upper bound ✔ accesses data in a linear fashion ✔ has as few branches as possible (simple for the GPU's kernel) ✔ does not have external method calls or contains only calls that can be easily inlined These help a JIT to either use auto-SIMD capabilities or GPUs
  • 62. © 2017 IBM Corporation 62 Problems 1) Data representation of columnar storage (CachedBatch with Array[Byte]) isn't commonly used 2) Compression schemes are specific to CachedBatch, limited to just several data types 3) Building in-memory cache involves a long code path -> virtual method calls, conditional branches 4) Generated whole-stage code -> unnecessary conversion from CachedBatch or ColumnarBatch to UnsafeRow Solutions 1) Use ColumnarBatch format instead of CachedBatch for the in-memory cache generated by the cache() method. ColumnarBatch and ColumnVector are commonly used data representations for columnar storage 2) Use a common compression scheme (e.g. lz4) for all of the data types in a ColumnVector 3) Generate code at runtime that is simple and specialized for building a concrete instance of the in- memory cache 4) Generate whole-stage code that directly reads data from columnar storage (1) and (2) increase code reuse, (3) improves runtime performance of executing the cache() method and (4) improves performance of user defined DataFrame and Dataset operations
  • 63. © 2017 IBM Corporation 63 We propose a new columnar format: CachedColumnarBatch, that has a pointer to ColumnarBatch (used by Parquet reader) that keeps each column as OnHeapUnsafeColumnVector instead of OnHeapColumnVector. Not yet using GPUS! ● [SPARK-13805], merged into 2.0, performance improvement: 1.2x Get data from ColumnVector directly by avoiding a copy from ColumnVector to UnsafeRow when a program reads data in parquet format ● [SPARK-14098] targeted for Spark 2.2, performance improvement: 3.4x Generate optimized code to build CachedColumnarBatch, get data from a ColumnVector directly by avoiding a copy from the ColumnVector to UnsafeRow, and use lz4 to compress ColumnVector when df.cache() or ds.cache is executed ● [SPARK-15962], merged into 2.1, performance improvement: 1.7x Remove indirection at offsets field when accessing each element in UnsafeArrayData, reduce memory footprint of UnsafeArrayData
  • 64. © 2017 IBM Corporation 64 ● [SPARK-15985], merged into 2.1, performance improvement: 1.3x Eliminate boxing operations to put a primitive array into GenericArrayData when a Dataset program with a primitive array is ran ● [SPARK-16213], merged into 2.2, performance improvement: 16.6x Eliminate boxing operations to put a primitive array into GenericArrayData when a DataFrame program with a primitive array is ran ● [SPARK-17490], merged into 2.1, performance improvement: 2.0x Eliminate boxing operations to put a primitive array into GenericArrayData when a DataFrame program with a primitive array is used
  • 65. © 2017 IBM Corporation 65 ● Another way to make exploiting GPUs easier ● We know how to build GPU based applications ● We can figure out if a GPU is available ● We can figure out what code to generate ● We can figure out which GPU to send that code to ● All while retaining Java safety features such as exceptions, bounds checking, serviceability, tracing and profiling hooks... Assuming you have the hardware, add an option and watch performance improve: this is ongoing work that can likely be applied to other projects What's in it for me if I care about Spark?
  • 66. © 2017 IBM Corporation 66 We want developers to be aware of these so we can work together ● Restricted by PCIe speed (less so with IBM hardware, Nvlink on Power) ● Writing a decent kernel is hard ● Optimum use of different memory types (global, shared, texture, registers), debugging (lots of seg faults, you're in the CUDA world now!), limited functions you can use in a kernel, maintaining contiguous access where possible ● Not many GPU developers out there relative to other language pros: want developers that know machine learning, know Java, know CUDA, know how to debug and profile ● Watch lots of videos and experiment – breaking things as you go and learning; need to achieve max parallelism, avoid seg faults, good fun ● Big changes to the CUDA SDK itself: this is for CUDA 7.5 and I learned with CUDA 5.5, likely lots of new features I'm not exploiting - keep up to date, I've seen important bug fixes going into driver/SDK releases Challenges for GPU programming
  • 67. © 2017 IBM Corporation 67 ● Profiling – many variables to tweak (and therefore many opportunities for benchmarking fun) ● More pitfalls than Java unless you're using sun.misc.unsafe or JNI ● CUDA was initially a problem to set up on my laptop (wanting to keep my desktop, use the Nvidia driver, use the CUDA toolkit AND a projector…) ● Debugging in a massively multithreaded environment...be careful of race conditions ● Ideally developers can focus on the kernel logic and design principles instead of how to write GPU code and how to manage things like scheduling and partitioning strategies (if you were to accelerate Apache Spark for example – lots of challenges here)
  • 68. © 2017 IBM Corporation 68 Bad JNI code ● Getting and releasing elements: invalid pointers, incorrect usage! ● Exceeding bounds ● Mixing C and C++ (*env)→ is C, env→ is C++ Bad Java code Sun.misc.Unsafe usage leading to seg faults Bad CUDA code Lots can go wrong here… ● Bad kernel – exceeding bounds, sending junk data, inefficient use of memory types ● Bad cudaMalloc, cudaMemcpy calls ● Not checking return codes Bad C/C++ code A presentation in its own right What else could possibly go wrong?
  • 69. © 2017 IBM Corporation 69 Bad design - your application just isn't a good fit for GPUs ● Not enough data for it to be worthwhile ● Too complex a problem – making new objects, lots of branching code, NOT doing lots of floating point/arithmetic operations… ● Way too much data to fit on GPUs and it'll be very difficult/time consuming to chunk it up (not all problems are going to be model parallelisable or data parallelisable) My CPUs are already good and cheap – and I'm not the one managing them anyway! I'll just get a few more instead of that brand spanking new GPU I may need to learn CUDA for... Use Apache Spark instead? Purchase more effective hardware – perhaps IO/network devices? Profiling – find out you're not actually compute bound? Write more efficient Java/Scala code (good use of private/final, benchmarking everything, again a talk in its own right, search “London Spark Meetup IBM Spark in production” Does it need to be in Java at all?
  • 70. © 2017 IBM Corporation 70 I'll only briefly cover each project as these are all excellent projects in their own right and worthy of their own detailed talks ● OpenCL ● Low level framework simplifies development for devices such as GPUs, FPGAs, works on AMD cards too, maintained by Khronos Group ● TensorFlow ● Java, C++, Python APIs, built for machine learning, researchers, data scientists, open source (mainly developed by Google) – features offloading to CUDA devices (requiring the toolkit/driver/cuDNN) ● SystemML ● IBM open-sourced project (now an Apache incubator), recently committed GPU support (yet to try) – write code in DML, easily scale out once ready Interesting projects related to GPUs/Java
  • 71. © 2017 IBM Corporation 71 ● Jcuda ● Alternative to CUDA4J (no IBM Java requirement) – more like C than Java, plenty of bindings for Nvidia libraries available and open source ● DeepLearning4j ● Want to find out more on this in particular: the most popular open source deep-learning framework for the JVM – mentions built-in GPU support so desperate to try this out; anything making GPU exploitation easier is good ● Nvidia libraries such as ● cuDNN ● Deep neural network library for CUDA devices ● cuBLAS ● Basic linear algebra subroutines on the GPU ● Thrust ● Precanned algorithms for HPC on CUDA devices (e.g. sorting)
  • 72. © 2017 IBM Corporation 72 ● Aparapi ● Excellent video by AMD Runtimes team on it (very few views on YouTube, highly recommended) ● Java API for the GPU – also Java styled like CUDA4J ● Converts Java code to OpenCL ● Requires overloading the run routine in a kernel class already (like you would for java.lang.Thread) ● jar file and shared libraries with JNI (so no JVM changes) ● Project Sumatra ● OpenJDK initiative for GPU support as part of the Java SDK ● Excellent video: “Sumatra OpenJDK Project Update” with < 120 views on YouTube! ● Details their approach to optimising forEach and reduce using GPUs ● Not tried this – mentions building Graal and Sumatra to get a “HSA enabled Graal-based JDK”, would be interesting to either collaborate/compare findings, and to look into Wholly Graal (another good video on YouTube about this – again with very few views)
  • 73. © 2017 IBM Corporation 73 Simple debugging example BuildAndRun.sh uses nvcc then executes the .out, code involves: int numElements=5000000; I initialize the arrays with the following code (the problem cause): double first[numElements]; double second[numElements]; Then do some CUDA stuff...so let's get a core to inspect
  • 74. © 2017 IBM Corporation 74 [aroberts@devoxx fun]$ gdb core.20916 [New LWP 20916] Core was generated by `./a.out'. Program terminated with signal 11, Segmentation fault. #0 0x000000000040263a in ?? () "/home/aroberts/fun/core.20916" is a core file. Please specify an executable to debug. (gdb) bt Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7ffd88544d10: (gdb) info registers rax 0x0 0 rbx 0x0 0 rcx 0xa0 160 rdx 0x7ffd88544d10 140726890679568 rsi 0x7ffe47106e58 140730090679896 rdi 0x1 1 rbp 0x7ffe47106d70 0x7ffe47106d70 rsp 0x7ffd88544d10 0x7ffd88544d10 ✔ stack pointer ✔ We get a memory error ➢ How big is numElements? ➢ You want an array on the stack that's how big?!
  • 75. © 2017 IBM Corporation 75 1) Rule out a CUDA programming mistake first (GPU side) – tweaking the kernel, judicious use of printf statements 2) Rule out a C++ side mistake (host side) Tools can help here if there's a standalone version of the application first – tuned and tweaked before being called out to from Java cuda-memcheck ./a.out ========= CUDA-MEMCHECK ========= Error: process didn't terminate successfully ========= The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or Nsight Eclipse Edition to catch host side errors. ========= Internal error (20) ========= No CUDA-MEMCHECK results found
  • 76. © 2017 IBM Corporation 76 Add “-g” to the nvcc compiler options for debug symbols, then run cuda-gdb again with your application (e.g. cuda-gdb a.out) (cuda-gdb) run Starting program: /home/aroberts/devoxx/./a.out Program received signal SIGSEGV, Segmentation fault. 0x0000000000402657 in main () at MatMulti.cu:14 14 printf("Startingn"); But that's the first line in my program! (cuda-gdb) c Continuing. Program terminated with signal SIGSEGV, Segmentation fault. The program no longer exists. I change my allocations to go on the heap using malloc. “I didn't need to worry about that in Java!”
  • 77. © 2017 IBM Corporation 77 And what happens if... ● We ask for too much memory on the GPU cudaError_t ret = cudaMalloc(&whopper, 1337999999); out of memory reported (using cudaGetErrorString(ret)) ● We specify we want too big a kernel using huge dimensions (so we must be careful generating these programatically i.e. using CUDA4J) __global__ void callMe(int anything) {} callMe<<<100000,1000>>>(100): CUDA error: invalid argument reported (I used cudaDeviceSynchronize and cudaGetLastError combined with cudaGetErrorString) ● Out of bounds in a kernel Junk data: send 50 ints to the GPU but launch kernel with a <<<1, 256>>> configuration and printf(“%d “, array[threadIdx.x]) in the kernel: -11039938 -1 -1 -1 -14325499 -722701 -1 -328454 -14588416
  • 78. © 2017 IBM Corporation 78 ● Global C++ variable referenced in my kernel Won't compile, undefined variable ● Kernel taking forever AND a useless kernel (while true loop) Message from syslogd@oc3752625144 at Mar 18 20:59:44 ...kernel:NMI watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [Xorg:2167] Can't do anything during this period – only tested on my laptop ● You try to use malloc inside a kernel ● No errors, can do int* hello = (int*) malloc(256); then hello[threadIdx.x] = threadIdx.x; printing hello[threadIdx.x] kind of works but I don't get all of the output on each application run (using a kernel config <<<1,256>>>) ● cudaMalloc? Not allowed as it's a __host__ function (and malloc isn't?!)
  • 79. © 2017 IBM Corporation 79 ● You try to call an arbitrary host function in your kernel Same as on the previous slide, not allowed (prevented by the compiler) ● Your kernel A calls kernel B Works but call the kernel as you would from the host (<<<,>>>), need to compile with nvcc -arch=sm_35 -rdc=true ● Your kernel A calls kernel B which then calls kernel A…
  • 80. © 2017 IBM Corporation 80 ● Easy way to get the latest JDKs (optionally with Apache Spark) below ● Great if you know CUDA already – but not required ● GPUs don't need to be expensive – but the server ones will be ● Useful for certain operations – not the “be all and end all” that's guaranteed to give you a boost, but why not make use of it if you have it ● Discoveries to be had performance testing with simple timers, tuning kernels, writing optimised CUDA code as well as your Java code (or using existing libraries) ● Lots of projects out there combining Java and GPUS! We're especially interested in delivering runtime improvements with minimal to no code changes required – partially by improving the IBM J9 VM itself (look out for OpenJ9) http://ibm.biz/spark-kit Feedback and suggestions welcome: aroberts@uk.ibm.com Closing remarks
  • 81. © 2017 IBM Corporation CUDA4J example code beyond this point.
  • 82. CUDA4J sample, part 1 of 3 import com.ibm.cuda.*; import com.ibm.cuda.CudaKernel.*; public class Sample { private static final boolean PRINT_DATA = false; private static int numElements; private static int[] myData; private static CudaBuffer buffer1; private static CudaDevice device = new CudaDevice(0); private static CudaModule module; private static CudaKernel kernel; private static CudaStream stream; public static void main(String[] args) { try { module = new Loader().loadModule("AdamDoubler.fatbin", device); kernel = new CudaKernel(module, "Cuda_cuda4j_AdamDoubler_Strider"); stream = new CudaStream(device); doSmallProblem(); doMediumProblem(); doChunkingProblem(); } catch (CudaException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } } private final static void doSmallProblem() throws Exception { System.out.println("Doing the small sized problem"); numElements = 100; myData = new int[numElements]; Util.fillWithInts(myData); CudaGrid grid = Util.makeGrid(numElements, stream); System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>"); buffer1 = new CudaBuffer(device, numElements * Integer.BYTES); buffer1.copyFrom(myData); Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements); kernel.launch(grid, kernelParams); int[] originalArrayCopy = new int[myData.length]; System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length); buffer1.copyTo(myData); Util.checkArrayResultsDoubler(myData, originalArrayCopy); }
  • 83. private final static void doMediumProblem() throws Exception { System.out.println("Doing the medium sized problem"); numElements = 5_000_000; myData = new int[numElements]; Util.fillWithInts(myData); // This is only when handling more than max blocks * max threads per kernel // Grid dim is the number of blocks in the grid // Block dim is the number of threads in a block // buffer1 is how we'll use our data on the GPU buffer1 = new CudaBuffer(device, numElements * Integer.BYTES); // myData is on CPU, transfer it buffer1.copyFrom(myData); // Our stream executes the kernel, can launch many streams at once CudaGrid grid = Util.makeGrid(numElements, stream); System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>"); Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements); kernel.launch(grid, kernelParams); int[] originalArrayCopy = new int[myData.length]; System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length); buffer1.copyTo(myData); Util.checkArrayResultsDoubler(myData, originalArrayCopy); } CUDA4J sample, part 2 of 3
  • 84. private final static void doChunkingProblem() throws Exception { // I know 5m doesn't require chunking on the GPU but this does System.out.println("Doing the too big to handle in one kernel problem"); numElements = 70_000_000; myData = new int[numElements]; Util.fillWithInts(myData); buffer1 = new CudaBuffer(device, numElements * Integer.BYTES); buffer1.copyFrom(myData); CudaGrid grid = Util.makeGrid(numElements, stream); System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>"); // Check we can actually launch a kernel with this grid size try { Parameters kernelParams = new Parameters(2).set(0, buffer1).set(1, numElements); kernel.launch(grid, kernelParams); int[] originalArrayCopy = new int[numElements]; System.arraycopy(myData, 0, originalArrayCopy, 0, numElements); buffer1.copyTo(myData); Util.checkArrayResultsDoubler(myData, originalArrayCopy); } catch (CudaException ce) { if (ce.getMessage().equals("invalid argument")) { System.out.println("it was invalid argument, too big!"); int maxThreadsPerBlockX = device.getAttribute(CudaDevice.ATTRIBUTE_MAX_BLOCK_DIM_X); int maxBlocksPerGridX = device.getAttribute(CudaDevice.ATTRIBUTE_MAX_GRID_DIM_Y); long maxThreadsPerGrid = maxThreadsPerBlockX * maxBlocksPerGridX; // 67,107,840 on my Windows box System.out.println("Max threads per grid: " + maxThreadsPerGrid); long numElementsAtOnce = maxThreadsPerGrid; long elementsDone = 0; grid = new CudaGrid(maxBlocksPerGridX, maxThreadsPerBlockX, stream); System.out.println("Kernel grid: <<<" + grid.gridDimX + ", " + grid.blockDimX + ">>>"); while (elementsDone < numElements) { if ( (elementsDone + numElementsAtOnce) > numElements) { numElementsAtOnce = numElements - elementsDone; // Just do the remainder } long toOffset = numElementsAtOnce + elementsDone; // It's the byte offset not the element index offset CudaBuffer slicedSection = buffer1.slice(elementsDone * Integer.BYTES, toOffset * Integer.BYTES); Parameters kernelParams = new Parameters(2).set(0, slicedSection).set(1, numElementsAtOnce); kernel.launch(grid, kernelParams); elementsDone += numElementsAtOnce; } int[] originalArrayCopy = new int[myData.length]; System.arraycopy(myData, 0, originalArrayCopy, 0, myData.length); buffer1.copyTo(myData); Util.checkArrayResultsDoubler(myData, originalArrayCopy); } else { System.out.println(ce.getMessage()); } } } CUDA4J sample, part 3 of 3
  • 85. CUDA4J kernel #include <stdint.h> #include <stdio.h> /** * 2D grid so we can have 1024 threads and many blocks * Remember 1 grid -> has blocks/threads and one kernel runs on one grid * In CUDA 6.5 we have cudaOccupancyMaxPotentialBlockSize which helps * * Let's say we have 100 ints to double, keeping it simple * Assume we want to run with 256 threads at once * For this size our kernel will be set up as follows * 1 grid, 1 block, 512 threads * blockDim.x is going to be 1 * threadIdx.x will remain at 0 * threadIdx.y will range from 0 to 512 * So we'll go from 1 to 512 and we'll limit access to how many elements we have */ extern "C" __global__ void Cuda_cuda4j_AdamDoubler(int* toDouble, int numElements){ int index = blockDim.x * threadIdx.x + threadIdx.y; if (index < numElements) { // Don't go out of bounds toDouble[index] *= 2; // Just double it } } extern "C" __global__ void Cuda_cuda4j_AdamDoubler_Strider(int* toDouble, int numElements){ int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < numElements) { // don't go overboard toDouble[i] *= 2; } }
  • 86. Utility methods, part 1 of 2 package com.ibm.CUDA4JExample; import com.ibm.cuda.*; public class Util { protected final static void fillWithInts(int[] toFill) { for (int i = 0; i < toFill.length; i++) { toFill[i] = i; } } protected final static void fillWithDoubles(double[] toFill) { for (int i = 0; i < toFill.length; i++) { toFill[i] = i; } } protected final static void printArray(int[] toPrint) { System.out.println(); for (int i = 0; i < toPrint.length; i++) { if (i == toPrint.length - 1) { System.out.print(toPrint[i] + "."); } else { System.out.print(toPrint[i] + ", "); } } System.out.println(); } protected final static CudaGrid makeGrid(int numElements, CudaStream stream) { int numThreads = 512; int numBlocks = (numElements + (numThreads - 1)) / numThreads; return new CudaGrid(numBlocks, numThreads, stream); }
  • 87. /* * Array will have been doubled at this point */ Protected final static void checkArrayResultsDoubler(int[] toCheck, int[] originalArray) { long errorCount = 0; // Check result, data has been copied back here if (toCheck.length != originalArray.length) { System.err.println("Something's gone horribly wrong, different array length"); } for (int i = 0; i < originalArray.length; i++) { if (toCheck[i] != (originalArray[i] * 2) ) { errorCount++; /* System.err.println("Got an error, " + originalArray[i] + " is incorrect: wasn't doubled correctly!" + " Got " + toCheck[i] + " but should be " + originalArray[i] * 2); */ } else { //System.out.println("Correct, doubled " + originalArray[i] + " and it became " + toCheck[i]); } } System.err.println("Incorrect results: " + errorCount); } } Utility methods, part 2 of 2
  • 88. CUDA4J module loader package com.ibm.CUDA4JExample; import java.io.FileNotFoundException; import java.io.IOException; import java.io.InputStream; import com.ibm.cuda.CudaDevice; import com.ibm.cuda.CudaException; import com.ibm.cuda.CudaModule; public final class Loader { private final CudaModule.Cache moduleCache = new CudaModule.Cache(); final CudaModule loadModule(String moduleName, CudaDevice device) throws CudaException, IOException { CudaModule module = moduleCache.get(device, moduleName); if (module == null) { try (InputStream stream = getClass().getResourceAsStream(moduleName)) { if (stream == null) { throw new FileNotFoundException(moduleName); } module = new CudaModule(device, stream); moduleCache.put(device, moduleName, module); } } return module; } }