SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
AMD’S	
  PROTOTYPE	
  HSAIL-­‐ENABLED	
  
JDK8	
  FOR	
  THE	
  OPENJDK	
  SUMATRA	
  
PROJECT	
  
APU’13	
  
ERIC	
  CASPOLE	
  –	
  AMD	
  SERVER	
  RUNTIMES	
  TEAM	
  	
  
	
  
AGENDA	
  
!  Java	
  and	
  Sumatra	
  OpenJDK	
  project	
  
!  GPU	
  workload	
  fundamentals	
  
!  AMD	
  APU	
  and	
  Heterogeneous	
  System	
  Architecture	
  (HSA)	
  
!  AMD	
  HSAIL-­‐enabled	
  offload	
  demo	
  JDK	
  
!  Summary	
  

2	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
WHY	
  JAVA?	
  
!  Java	
  by	
  the	
  numbers	
  	
  
‒ 9	
  Million	
  Developers	
  
‒ 1	
  Billion	
  Java	
  downloads	
  per	
  year	
  
‒ 97%	
  	
  Enterprise	
  desktops	
  run	
  Java	
  
‒ 100%	
  	
  of	
  blue	
  ray	
  players	
  ship	
  with	
  Java	
  
hap://oracle.com.edgesuite.net/emeline/java/	
  
!  Java	
  7	
  language	
  &	
  libraries	
  already	
  include	
  concurrency	
  features	
  	
  
‒ primieves	
  (threads,	
  locks,	
  monitors,	
  atomic	
  ops)	
  
‒ libraries	
  (fork/join,	
  thread	
  pools,	
  executors,	
  futures)	
  
!  Upcoming	
  Java	
  8	
  include	
  stream	
  processing	
  enhancements	
  
‒ support	
  for	
  ‘lambda’	
  	
  expressions	
  	
  
‒ Lambda	
  centric	
  concurrent	
  stream	
  processing	
  libs/apis	
  	
  
(java.uel.stream.*)	
  	
  	
  
3	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
SUMATRA	
  OPENJDK	
  PROJECT	
  
!  Intending	
  to	
  enable	
  Java	
  applicaeons	
  to	
  take	
  advantage	
  of	
  GPU/APU	
  
‒  More	
  or	
  less	
  transparently	
  to	
  the	
  applicaeon	
  
‒  No	
  applicaeon	
  naeve	
  code	
  required	
  

!  Project	
  started	
  by	
  Oracle	
  and	
  AMD	
  shortly	
  before	
  JavaOne	
  2012	
  
!  GPU/APUs	
  offer	
  a	
  lot	
  of	
  processing	
  power	
  
‒  2000	
  ASCI	
  RED,	
  Sandia	
  Naeonal	
  Laboratories	
  
‒  	
  World’s	
  #1	
  supercomputer	
  
‒  	
  hap://www.top500.org/system/ranking/4428	
  	
  
‒  ~3,200	
  GFLOPS	
  

‒  2013	
  AMD	
  Radeon™	
  HD	
  7990	
  
‒  Released	
  April	
  2013,	
  about	
  $700	
  on	
  amazon.com	
  
‒  ~8200	
  GFLOPS	
  

!  HSA/OpenCL/CUDA	
  standardize	
  how	
  to	
  express	
  both	
  the	
  GPU	
  compute	
  and	
  host	
  programming	
  
requirements	
  
‒  But	
  not	
  easy	
  to	
  use	
  from	
  Java	
  without	
  a	
  lot	
  of	
  naeve	
  code	
  and	
  experese	
  
‒  Exiseng	
  APIs	
  include	
  Aparapi,	
  JOCL,	
  OpenCL4Java,	
  and	
  others	
  
4	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
IDEALLY, WE CAN TARGET COMPUTE AT THE MOST SUITABLE
DEVICE
CPU	
  excels	
  at	
  sequenCal,	
  branchy	
  code,	
  I/O	
  
interacCon,	
  system	
  programming.	
  
Most	
  Java	
  applicaCons	
  have	
  these	
  characterisCcs	
  
and	
  excel	
  on	
  the	
  CPU.	
  

GPU	
  excels	
  at	
  data-­‐parallel	
  tasks,	
  image	
  
processing,	
  and	
  data	
  analysis.	
  
Java	
  is	
  used	
  in	
  these	
  areas/domains,	
  but	
  does	
  not	
  
exploit	
  the	
  capabiliCes	
  of	
  the	
  GPU	
  as	
  a	
  compute	
  
device.	
  

Graphics Workloads
Serial/Task-parallel
Workloads

5	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

Other Highly Parallel
Workloads
IDEAL DATA PARALLEL ALGORITHMS/WORKLOADS	
  
!  GPU	
  SIMDs	
  are	
  opemized	
  for	
  data-­‐parallel	
  operaeons	
  
‒  Performing	
  the	
  same	
  sequence	
  of	
  operaeons	
  on	
  different	
  data	
  at	
  the	
  same	
  eme	
  
‒  Each	
  GPU	
  core	
  gets	
  a	
  unique	
  work	
  item	
  id,	
  ouen	
  used	
  as	
  an	
  array	
  index	
  

!  The	
  body	
  of	
  loops	
  are	
  a	
  good	
  place	
  to	
  look	
  for	
  data-­‐parallel	
  opportuniees	
  
// Each loop iteration is independent
for (int i=0; i< 100; i++)
out[i] = in[i]*in[i];

!  As	
  a	
  JDK	
  8	
  Stream	
  operaeon:	
  
‒  This	
  is	
  a	
  thread-­‐safe	
  calculaeon	
  and	
  could	
  be	
  a	
  parallel	
  stream	
  
IntStream.range(0, in.length).forEach( p -> {
out[p] = in[p] * in[p];
});

!  Parecularly	
  if	
  we	
  can	
  loop	
  in	
  any	
  order	
  and	
  get	
  same	
  result	
  
for (int i=99; i<= 0; i--)
out[i] = in[i]*in[i];

6	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
WATCH OUT FOR DEPENDENCIES AND BOTTLENECKS	
  
!  Data dependencies can violate the “in any order” guideline
// for loop style
for (int i=1; i<100; i++) {
out[i] = out[i-1] + in[i];
}

// stream style
IntStream.range(0, in.length).forEach( p -> {
out[p] = out[p-1] * in[p];
});

!  Mutating shared data can force use of atomic constructs
‒  Note lambdas do not allow modifying captured values
for (int i=0; i< 100; i++)
sum += in[i];

7	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
MEET	
  HSA	
  AND	
  HSAIL	
  
!  Heterogeneous	
  System	
  Architecture	
  standardizes	
  CPU/GPU	
  funceonality	
  
‒ Be	
  ISA-­‐agnosec	
  for	
  both	
  CPUs	
  and	
  accelerators	
  
‒ Support	
  high-­‐level	
  programming	
  languages	
  
‒ Provide	
  the	
  ability	
  to	
  access	
  pageable	
  system	
  memory	
  from	
  the	
  GPU	
  
‒ Maintain	
  cache	
  coherency	
  for	
  system	
  memory	
  between	
  CPU	
  and	
  GPU	
  

!  Specificaeons	
  and	
  simulator	
  from	
  HSA	
  Foundaeon	
  
‒ HSAIL	
  portable	
  ISA	
  is	
  	
  “finalized”	
  to	
  parecular	
  hardware	
  ISA	
  at	
  runeme	
  
‒ runeme	
  specificaeon	
  for	
  job	
  launch	
  and	
  control	
  
‒ HSAIL	
  simulator	
  for	
  development	
  and	
  teseng	
  before	
  hardware	
  availability	
  

8	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
AMD	
  ACCELERATED	
  PROCESSING	
  UNIT	
  
!  AMD	
  APU	
  
‒  CPU/GPU	
  on	
  one	
  integrated	
  chip	
  
‒  Various	
  APU	
  models	
  shipping	
  since	
  June	
  2011	
  
‒  The	
  upcoming	
  “Kaveri”	
  APU	
  will	
  be	
  the	
  first	
  to	
  support	
  HSA	
  souware	
  stack	
  

!  HSA	
  makes	
  a	
  great	
  playorm	
  for	
  Java	
  offload	
  
‒  Direct	
  access	
  to	
  Java	
  heap	
  objects	
  in	
  main	
  memory	
  from	
  GPU	
  cores	
  
‒  No	
  extra	
  copying	
  over	
  bus	
  to	
  discrete	
  card	
  
‒  Pointer	
  is	
  a	
  pointer	
  from	
  CPU	
  or	
  GPU	
  applicaeon	
  code	
  

9	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
AMD	
  hUMA	
  ARCHITECTURE	
  
!  Upcoming	
  AMD	
  APUs	
  feature	
  heterogeneous	
  Uniform	
  Memory	
  Access	
  
‒  Designed	
  to	
  work	
  with	
  HSA	
  
‒  Pointer	
  is	
  a	
  pointer	
  from	
  CPU	
  or	
  GPU	
  applicaeon	
  code	
  -­‐-­‐	
  no	
  copying	
  over	
  a	
  bus	
  

10	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
AMD	
  SUMATRA	
  PROTOTYPE:	
  APU	
  MEETS	
  STREAM	
  API	
  
!  Enables	
  HSA	
  APU	
  offload	
  of	
  some	
  JDK	
  8	
  parallel	
  stream	
  lambdas	
  
‒  Use	
  of	
  parallel() means	
  developer	
  thinks	
  it’s	
  thread-­‐safe	
  
‒  No	
  special	
  API	
  or	
  coding	
  requirements	
  for	
  applicaeon	
  developer	
  

!  We	
  are	
  adding	
  HSAIL	
  support	
  to	
  Graal	
  
‒  Basic	
  HSAIL	
  funceonality	
  already	
  commiaed	
  into	
  Graal	
  project	
  

!  We	
  hook	
  into	
  java.util.stream.ForEachOp	
  to	
  redirect	
  to	
  our	
  HSA	
  offload	
  code	
  
‒  ForEach	
  “side	
  effect”	
  operaeon	
  fits	
  well	
  with	
  GPU	
  data-­‐parallel	
  model	
  
‒  Do	
  math,	
  set	
  field	
  values,	
  but	
  no	
  allocaeon	
  or	
  synchronizaeon	
  yet	
  
‒  Direct	
  access	
  to	
  Java	
  objects	
  in	
  the	
  heap	
  from	
  GPU	
  cores	
  

!  Seamless	
  fallback	
  to	
  regular	
  JDK	
  code	
  if	
  code	
  gen	
  or	
  offload	
  fails	
  
!  This	
  code	
  available	
  in	
  Graal	
  and	
  a	
  JDK	
  webrev	
  to	
  be	
  built	
  together	
  
!  Can	
  be	
  easily	
  run	
  in	
  open-­‐source	
  HSA	
  simulator	
  on	
  Linux	
  systems	
  without	
  a	
  GPU	
  

11	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
AMD	
  SUMATRA	
  PROTOTYPE:	
  DIAGRAM	
  
!  Our	
  JDK	
  uses	
  open-­‐source	
  HSAIL	
  tools	
  
!  OKRA	
  is	
  a	
  layer	
  allowing	
  easy	
  use	
  of	
  the	
  HSAIL	
  tools	
  from	
  
Java,	
  included	
  in	
  Github	
  simulator	
  repository	
  
!  HSAIL	
  tools	
  assemble	
  and	
  finalize	
  the	
  HSAIL	
  source	
  
emiaed	
  by	
  Graal	
  
!  OKRA	
  passes	
  arguments	
  to	
  HSA	
  Runeme	
  and	
  runs	
  kernel	
  

Java	
  ApplicaCon	
  	
  
	
  
IntStream.range(1024).forEach(p -> {/* lambda */});

Graal	
  	
  
emits	
  HSAIL	
  

OKRA	
  
JNI	
  	
  

HSA	
  Kernel	
  
{/* lambda */}

12	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

JDK	
  8	
  Stream	
  API	
  
Modified	
  ForEach	
  

OKRA	
  
Finalizes	
  kernel	
  	
  
using	
  HSA	
  Tools	
  	
  
HSA	
  RunCme	
  
runs	
  kernel	
  	
  
on	
  APU	
  or	
  simulator	
  
HOW	
  IT	
  WORKS	
  
!  APUs	
  have	
  hundreds	
  of	
  GPU	
  cores	
  
‒  HSA	
  workitem	
  id	
  is	
  used	
  as	
  array	
  index	
  for	
  each	
  GPU	
  core	
  
‒  Each	
  core	
  does	
  one	
  workitem	
  per	
  wavefront	
  
‒  Think	
  of	
  it	
  as	
  hundreds	
  of	
  threads,	
  each	
  running	
  one	
  funceon	
  per	
  invocaeon	
  

!  This	
  JDK	
  allows	
  IntStream	
  or	
  Object Array/Vector/ArrayList	
  stream	
  offload	
  
‒  We	
  added	
  an	
  extra	
  class	
  into	
  java.util.stream	
  to	
  handle	
  our	
  extra	
  stream	
  processing	
  
‒  Stream	
  source	
  object	
  array	
  passed	
  as	
  hidden	
  parameter	
  to	
  HSA	
  
‒  Object	
  Stream	
  kernel	
  receives	
  array	
  ref	
  and	
  uses	
  work	
  item	
  id	
  as	
  array	
  index	
  
‒  Regular	
  CPU	
  lambda	
  code	
  receives	
  Object	
  as	
  its	
  parameter	
  

‒  IntStream	
  range	
  comes	
  from	
  HSA	
  workitem	
  id	
  itself	
  

!  Collect	
  the	
  lambda	
  target	
  method	
  at	
  ForEachOp	
  diversion	
  point	
  
‒  Send	
  lambda	
  method	
  to	
  Graal	
  HSAIL	
  compiler	
  
‒  Graal	
  emits	
  HSAIL	
  text	
  then	
  sent	
  to	
  HSA	
  Finalizer	
  for	
  kernel	
  creaeon	
  
‒  Kernel	
  is	
  cached	
  for	
  subsequent	
  execueons	
  

13	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
RUNNING	
  GRAAL	
  HSAIL	
  USING	
  SIMULATOR	
  IN	
  NETBEANS	
  IDE	
  
!  This	
  code	
  is	
  available	
  now	
  in	
  Github	
  and	
  OpenJDK	
  and	
  you	
  can	
  have	
  it	
  running	
  in	
  the	
  IDE	
  in	
  15	
  minutes	
  

14	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
MORE	
  DETAILS	
  
!  Lambda	
  arguments	
  collected	
  from	
  consumer	
  object	
  created	
  by	
  stream	
  API	
  
‒  Captured	
  args	
  passed	
  as	
  parameters	
  to	
  HSA	
  kernel	
  same	
  as	
  CPU	
  code	
  

!  Referenced	
  fields	
  are	
  accessed	
  through	
  memory	
  ops	
  like	
  CPU-­‐compiled	
  methods	
  
‒  Offsets	
  into	
  objects	
  computed	
  by	
  Graal	
  same	
  as	
  CPU	
  codegen	
  

!  Staec	
  fields	
  accessed	
  through	
  JNI	
  indirect	
  reference	
  
‒  No	
  finalized	
  code	
  patching	
  at	
  this	
  eme,	
  so	
  no	
  GC-­‐changeable	
  embedded	
  constants	
  

!  OKRA	
  is	
  a	
  temporary	
  interface	
  to	
  interact	
  with	
  HSA	
  Runeme	
  
‒  Java	
  thread	
  calls	
  our	
  OKRA	
  JNI	
  code	
  and	
  blocks	
  while	
  kernel	
  runs	
  
‒  OKRA	
  is	
  designed	
  to	
  work	
  well	
  with	
  the	
  HSA	
  simulator	
  

15	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
IntStream	
  EXAMPLE	
  
!  Offload	
  baseball	
  staesecs	
  using	
  IntStream
‒  Player	
  objects	
  have	
  accessors	
  for	
  various	
  stat	
  categories	
  
‒  Calculate	
  the	
  ba{ng	
  average	
  for	
  each	
  player	
  
‒  IntStream.forEach	
  lambda	
  code	
  in	
  red	
  is	
  converted	
  to	
  HSA	
  kernel	
  
Player[] players; // Player array initialization omitted
IntStream.range(0, players.length).parallel().forEach(n -> {
Player p = players[n];
if (p.getAb() > 0) {
p.setBa((float)p.getHits() / (float)p.getAb());
} else {
p.setBa((float) 0.0);
}
});

16	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
HSAIL	
  FOR	
  IntStream	
  LAMBDA	
  FROM	
  GRAAL	
  
version 0:95: $full : $large;
// static method HotSpotMethod<Main.lambda$7(Player[], int)>
kernel &run (
kernarg_u64 %_arg0
) {
ld_kernarg_u64 $d6, [%_arg0];
// Captured array ref
workitemabsid_u32 $s2, 0;
// work item id is a gpu idiom
@L4:
@L6:

@L8:
@L9:
@L7:

@L5:
};

ld_global_s32 $s0, [$d6 + 16];
cmp_ge_b1_u32 $c0, $s2, $s0;
cbr $c0, @L5;
cvt_s64_s32 $d0, $s2;
mul_s64 $d0, $d0, 8;
add_u64 $d3, $d6, $d0;
ld_global_u64 $d0, [$d3 + 24];
mov_b64 $d3, $d0;
ld_global_s32 $s3, [$d0 + 20];
cmp_lt_b1_s32 $c0, 0, $s3;
cbr $c0, @L7;
mov_b32 $s16, 0.0f;
st_global_f32 $s16, [$d0 + 76];
ret;
ld_global_s32 $s1, [$d0 + 28];
cvt_f32_s32 $s16, $s1;
cvt_f32_s32 $s17, $s3;
div_f32 $s16, $s16, $s17;
st_global_f32 $s16, [$d0 + 76];
brn @L9;
ret;
	
  	
  

17	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

// load array length
// compare length to workitemid
// return if greater
// convert work item into array index
// load player object
// this is inlined getAb()
// if (p.getAb() > 0)
// p.setBa((float) 0.0);
//
//
//
//
//

inlined getHits()
cast (float)p.getHits()
cast (float)p.getAb()
hits / ab
inlined setBa()
SMALL	
  OBJECT	
  STREAM	
  EXAMPLE	
  
!  Same	
  example	
  as	
  Object	
  Stream	
  
‒  The	
  Stream.forEach	
  lambda	
  is	
  converted	
  to	
  an	
  HSA	
  kernel	
  
‒  Stream	
  source	
  array	
  is	
  passed	
  as	
  a	
  hidden	
  parameter	
  to	
  kernel	
  

Stream<Player> s = Arrays.stream(allHitters).parallel();
s.forEach(p -> {
if (p.getAb() > 0) {
p.setBa((float)p.getHits() / (float)p.getAb());
} else {
p.setBa((float)0.0);
}
});

18	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
HSAIL	
  FOR	
  OBJECT	
  STREAM	
  LAMBDA	
  
version 0:95: $full : $large;
// static method HotSpotMethod<Main.lambda$3(Player)>
kernel &run (
kernarg_u64 %_arg0
) {
ld_kernarg_u64 $d6, [%_arg0];
// Hidden stream source array ref
workitemabsid_u32 $s2, 0;
cvt_u64_s32 $d2, $s2;
mul_u64 $d2, $d2, 8;
add_u64 $d2, $d2, 24;
add_u64 $d2, $d2, $d6;
ld_global_u64 $d6, [$d2];

//
//
//
//
//

Convert work item id to long
Adjust index for sizeof ref
Adjust for actual elements data start
Add to array ref ptr
Load from array element into parameter reg

@L0:
ld_global_s32 $s0, [$d6 + 20];
cmp_lt_b1_s32 $c0, 0, $s0;
cbr $c0, @L1;

// inlined getAb()
// if (p.getAb() > 0)

@L2:
mov_b32 $s16, 0.0f;
st_global_f32 $s16, [$d6 + 76];

// p.setBa((float)0.0);

@L3:
ret;
@L1:
ld_global_s32 $s3, [$d6 + 28];
cvt_f32_s32 $s16, $s3;
cvt_f32_s32 $s17, $s0;
div_f32 $s16, $s16, $s17;
st_global_f32 $s16, [$d6 + 76];
brn @L3;
};
19	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

// load p.getHits()
// (float) p.getHits()
// (float) p.getAb()
// inlined setBa()
CURRENT	
  LIMITATIONS	
  OF	
  HSAIL	
  OFFLOAD	
  DEMO	
  JDK	
  
!  Currently	
  not	
  allowed	
  in	
  an	
  offloaded	
  kernel	
  
‒  No	
  heap	
  allocaeon	
  
‒  No	
  excepeon	
  handling	
  or	
  try/catch	
  inside	
  a	
  kernel	
  
‒  No	
  calling	
  methods	
  that	
  would	
  be	
  a	
  JNI	
  or	
  runeme	
  call	
  
‒  No	
  synchronizaeon	
  in	
  kernels	
  
‒  No	
  method	
  handles	
  in	
  target	
  lambda	
  methods	
  

!  Kernels	
  are	
  called	
  by	
  JNI	
  code	
  using	
  JNI	
  Criecal	
  
‒  So	
  no	
  GC	
  during	
  kernel	
  execueon	
  
‒  Finalized	
  kernels	
  cannot	
  support	
  a	
  GC	
  a	
  this	
  eme

20	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
FEATURES	
  WE	
  HOPE	
  TO	
  STANDARDIZE	
  IN	
  SUMATRA	
  
!  What	
  is	
  the	
  heurisec	
  or	
  coding	
  model	
  for	
  offloading?	
  
‒  We	
  chose	
  parallel	
  streams	
  based	
  on	
  our	
  experience	
  with	
  Aparapi	
  and	
  GPUs	
  
‒  This	
  model	
  does	
  not	
  require	
  developers	
  to	
  learn	
  new	
  API,	
  etc.	
  

!  GC	
  interaceon?	
  
‒  Possible	
  or	
  worthwhile	
  to	
  have	
  safepoints	
  during	
  kernel	
  execueon?	
  

!  What	
  runeme	
  calls	
  or	
  allocaeon	
  from	
  a	
  kernel	
  can	
  be	
  supported?	
  
‒  Runeme	
  calls	
  imply	
  pausing	
  the	
  GPU	
  kernel	
  and	
  resuming	
  on	
  the	
  CPU	
  

!  Excepeon	
  handling?	
  
‒  Throw	
  inside	
  kernel	
  with	
  its	
  own	
  try-­‐catch	
  block	
  handling	
  it	
  
‒  Throw	
  causing	
  kernel	
  abort	
  and	
  handled	
  in	
  runeme	
  on	
  CPU	
  

!  What	
  synchronizaeon	
  can	
  be	
  supported	
  in	
  kernels?	
  
‒  Between	
  GPU	
  cores	
  
‒  Between	
  CPU	
  and	
  GPU	
  

21	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
FEATURES	
  WE	
  HOPE	
  TO	
  STANDARDIZE	
  IN	
  SUMATRA	
  
!  Details	
  of	
  HSA	
  versus	
  discrete	
  card	
  offload?	
  
‒  Copying/replacing	
  buffers	
  to	
  card	
  vs.	
  direct	
  heap	
  access	
  in	
  HSA	
  
‒  Any	
  difference	
  in	
  interaceon	
  with	
  JVM	
  runeme?	
  

!  How	
  to	
  detect	
  and	
  configure	
  various	
  offload	
  runeme	
  systems	
  from	
  Java?	
  
‒  HSAIL/BRIG,	
  PTX,	
  etc.	
  
‒  Select	
  offload	
  GPU(s)	
  if	
  more	
  than	
  one	
  available	
  

22	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
SUMMARY	
  
!  We	
  can	
  offload	
  simple	
  JDK	
  8	
  Stream	
  API	
  forEach	
  lambdas	
  to	
  HSA	
  systems	
  
‒  Seamlessly	
  offload	
  normal	
  JDK	
  8	
  code	
  
‒  No	
  special	
  coding	
  or	
  API	
  required	
  

!  Basic	
  HSAIL	
  code	
  generaeon	
  now	
  in	
  Graal	
  repository	
  
!  HSAIL	
  simulator	
  is	
  available	
  and	
  our	
  HSAIL	
  demo	
  JDK	
  uses	
  it	
  
‒  Detailed	
  check-­‐out	
  and	
  build	
  instruceons	
  on	
  the	
  Sumatra	
  wiki:	
  haps://wiki.openjdk.java.net/display/Sumatra/Main	
  

!  GPU	
  offload	
  for	
  Java	
  is	
  here	
  
‒  GPUs	
  offer	
  unprecedented	
  performance	
  for	
  the	
  appropriate	
  workload	
  
‒  Don’t	
  assume	
  everything	
  can/should	
  execute	
  on	
  the	
  GPU	
  
‒  Look	
  for	
  “islands	
  of	
  parallel	
  in	
  a	
  sea	
  of	
  sequeneal”	
  

!  Lots	
  of	
  work	
  remains!	
  

23	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
LINKS	
  AND	
  REFERENCES	
  
!  Sumatra	
  OpenJDK	
  GPU/APU	
  offload	
  project	
  
‒  Project	
  home	
  page:	
  hap://openjdk.java.net/projects/sumatra/	
  
‒  Wiki:	
  haps://wiki.openjdk.java.net/display/Sumatra/Main	
  

!  Graal	
  JIT	
  compiler	
  and	
  runeme	
  project	
  
‒  Project	
  home	
  page:	
  hap://openjdk.java.net/projects/graal/	
  

!  HSA	
  Foundaeon	
  
‒  Home	
  page:	
  hap://hsafoundaeon.com/	
  
‒  Specificaeons	
  at	
  hap://hsafoundaeon.com/standards/	
  

!  “Kaveri”	
  APU	
  Overview	
  
‒  hap://www.theregister.co.uk/2013/05/01/amd_huma/	
  

24	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  
DISCLAIMER	
  &	
  ATTRIBUTION	
  

The	
  informaeon	
  presented	
  in	
  this	
  document	
  is	
  for	
  informaeonal	
  purposes	
  only	
  and	
  may	
  contain	
  technical	
  inaccuracies,	
  omissions	
  and	
  typographical	
  errors.	
  
	
  
The	
  informaeon	
  contained	
  herein	
  is	
  subject	
  to	
  change	
  and	
  may	
  be	
  rendered	
  inaccurate	
  for	
  many	
  reasons,	
  including	
  but	
  not	
  limited	
  to	
  product	
  and	
  roadmap	
  
changes,	
  component	
  and	
  motherboard	
  version	
  changes,	
  new	
  model	
  and/or	
  product	
  releases,	
  product	
  differences	
  between	
  differing	
  manufacturers,	
  souware	
  
changes,	
  BIOS	
  flashes,	
  firmware	
  upgrades,	
  or	
  the	
  like.	
  AMD	
  assumes	
  no	
  obligaeon	
  to	
  update	
  or	
  otherwise	
  correct	
  or	
  revise	
  this	
  informaeon.	
  However,	
  AMD	
  
reserves	
  the	
  right	
  to	
  revise	
  this	
  informaeon	
  and	
  to	
  make	
  changes	
  from	
  eme	
  to	
  eme	
  to	
  the	
  content	
  hereof	
  without	
  obligaeon	
  of	
  AMD	
  to	
  noefy	
  any	
  person	
  of	
  
such	
  revisions	
  or	
  changes.	
  
	
  
AMD	
  MAKES	
  NO	
  REPRESENTATIONS	
  OR	
  WARRANTIES	
  WITH	
  RESPECT	
  TO	
  THE	
  CONTENTS	
  HEREOF	
  AND	
  ASSUMES	
  NO	
  RESPONSIBILITY	
  FOR	
  ANY	
  
INACCURACIES,	
  ERRORS	
  OR	
  OMISSIONS	
  THAT	
  MAY	
  APPEAR	
  IN	
  THIS	
  INFORMATION.	
  
	
  
AMD	
  SPECIFICALLY	
  DISCLAIMS	
  ANY	
  IMPLIED	
  WARRANTIES	
  OF	
  MERCHANTABILITY	
  OR	
  FITNESS	
  FOR	
  ANY	
  PARTICULAR	
  PURPOSE.	
  IN	
  NO	
  EVENT	
  WILL	
  AMD	
  BE	
  
LIABLE	
  TO	
  ANY	
  PERSON	
  FOR	
  ANY	
  DIRECT,	
  INDIRECT,	
  SPECIAL	
  OR	
  OTHER	
  CONSEQUENTIAL	
  DAMAGES	
  ARISING	
  FROM	
  THE	
  USE	
  OF	
  ANY	
  INFORMATION	
  
CONTAINED	
  HEREIN,	
  EVEN	
  IF	
  AMD	
  IS	
  EXPRESSLY	
  ADVISED	
  OF	
  THE	
  POSSIBILITY	
  OF	
  SUCH	
  DAMAGES.	
  
	
  
ATTRIBUTION	
  
©	
  2013	
  Advanced	
  Micro	
  Devices,	
  Inc.	
  All	
  rights	
  reserved.	
  AMD,	
  the	
  AMD	
  Arrow	
  logo	
  and	
  combinaeons	
  thereof	
  are	
  trademarks	
  of	
  Advanced	
  Micro	
  Devices,	
  
Inc.	
  in	
  the	
  United	
  States	
  and/or	
  other	
  jurisdiceons.	
  	
  SPEC	
  	
  is	
  a	
  registered	
  trademark	
  of	
  the	
  Standard	
  Performance	
  Evaluaeon	
  Corporaeon	
  (SPEC).	
  Other	
  
names	
  are	
  for	
  informaeonal	
  purposes	
  only	
  and	
  may	
  be	
  trademarks	
  of	
  their	
  respeceve	
  owners.	
  
25	
   |	
  	
  	
  PRESENTATION	
  TITLE	
  	
  	
  |	
  	
  	
  NOVEMBER	
  19,	
  2013	
  	
  	
  |	
  	
  	
  CONFIDENTIAL	
  

Weitere ähnliche Inhalte

Was ist angesagt?

PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningAMD Developer Central
 
HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013HSA Foundation
 
Hkube
HkubeHkube
Hkubehkube
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology Avinash Gautam
 
Instant scaling and deployment of Vitis Libraries on Alveo clusters using InA...
Instant scaling and deployment of Vitis Libraries on Alveo clusters using InA...Instant scaling and deployment of Vitis Libraries on Alveo clusters using InA...
Instant scaling and deployment of Vitis Libraries on Alveo clusters using InA...Christoforos Kachris
 
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben SanderHSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben SanderAMD Developer Central
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveDataWorks Summit
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache SparkSimon Lia-Jonassen
 
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLBoosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLJanakiRam Raghumandala
 
MLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersMLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersDataWorks Summit
 
Infra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASKInfra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASKRob Mueller
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningSystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningLuciano Resende
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirLuciano Resende
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native ServicesAccumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native ServicesAccumulo Summit
 
Accumulo Summit 2016: Effective Testing of Apache Accumulo Iterators
Accumulo Summit 2016: Effective Testing of Apache Accumulo IteratorsAccumulo Summit 2016: Effective Testing of Apache Accumulo Iterators
Accumulo Summit 2016: Effective Testing of Apache Accumulo IteratorsAccumulo Summit
 

Was ist angesagt? (20)

PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil HenningPL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
PL-4048, Adapting languages for parallel processing on GPUs, by Neil Henning
 
HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013HSA Memory Model Hot Chips 2013
HSA Memory Model Hot Chips 2013
 
HSA Features
HSA FeaturesHSA Features
HSA Features
 
Hkube
HkubeHkube
Hkube
 
InAccel FPGA resource manager
InAccel FPGA resource managerInAccel FPGA resource manager
InAccel FPGA resource manager
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology
 
Instant scaling and deployment of Vitis Libraries on Alveo clusters using InA...
Instant scaling and deployment of Vitis Libraries on Alveo clusters using InA...Instant scaling and deployment of Vitis Libraries on Alveo clusters using InA...
Instant scaling and deployment of Vitis Libraries on Alveo clusters using InA...
 
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben SanderHSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
HSA-4131, HSAIL Programmers Manual: Uncovered, by Ben Sander
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache Hive
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCLBoosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
Boosting your HTML Apps – Overview of OpenCL and Hello World of WebCL
 
MLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API ServersMLeap: Deploy Spark ML Pipelines to Production API Servers
MLeap: Deploy Spark ML Pipelines to Production API Servers
 
Infra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASKInfra space talk on Apache Spark - Into to CASK
Infra space talk on Apache Spark - Into to CASK
 
SystemML - Declarative Machine Learning
SystemML - Declarative Machine LearningSystemML - Declarative Machine Learning
SystemML - Declarative Machine Learning
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Apache spark
Apache sparkApache spark
Apache spark
 
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native ServicesAccumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
 
Accumulo Summit 2016: Effective Testing of Apache Accumulo Iterators
Accumulo Summit 2016: Effective Testing of Apache Accumulo IteratorsAccumulo Summit 2016: Effective Testing of Apache Accumulo Iterators
Accumulo Summit 2016: Effective Testing of Apache Accumulo Iterators
 

Ähnlich wie HSA-4024, OpenJDK Sumatra Project: Bringing the GPU to Java, by Eric Caspole

PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...AMD Developer Central
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...AMD Developer Central
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasAMD Developer Central
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective HSA Foundation
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...AMD Developer Central
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataTaro L. Saito
 
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...AMD Developer Central
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseKazuaki Ishizaki
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterTim Ellison
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.J On The Beach
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...AMD Developer Central
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
 
Keynote (Dr. Lisa Su) - Developers: The Heart of AMD Innovation - by Dr. Lisa...
Keynote (Dr. Lisa Su) - Developers: The Heart of AMD Innovation - by Dr. Lisa...Keynote (Dr. Lisa Su) - Developers: The Heart of AMD Innovation - by Dr. Lisa...
Keynote (Dr. Lisa Su) - Developers: The Heart of AMD Innovation - by Dr. Lisa...AMD Developer Central
 
Final lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tbFinal lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tbr Skip
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysTaylor Riggan
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013 HSA Foundation
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 

Ähnlich wie HSA-4024, OpenJDK Sumatra Project: Bringing the GPU to Java, by Eric Caspole (20)

PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
PL-4042, Wholly Graal: Accelerating GPU offload for Java/Sumatra using the Op...
 
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
 
HSA From A Software Perspective
HSA From A Software Perspective HSA From A Software Perspective
HSA From A Software Perspective
 
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
CC-4000, Characterizing APU Performance in HadoopCL on Heterogeneous Distribu...
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
PT-4053, Advanced OpenCL - Debugging and Profiling Using AMD CodeXL, by Uri S...
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Think evo and use evo
Think evo and use evoThink evo and use evo
Think evo and use evo
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
PT-4142, Porting and Optimizing OpenMP applications to APU using CAPS tools, ...
 
Spark 101
Spark 101Spark 101
Spark 101
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
Keynote (Dr. Lisa Su) - Developers: The Heart of AMD Innovation - by Dr. Lisa...
Keynote (Dr. Lisa Su) - Developers: The Heart of AMD Innovation - by Dr. Lisa...Keynote (Dr. Lisa Su) - Developers: The Heart of AMD Innovation - by Dr. Lisa...
Keynote (Dr. Lisa Su) - Developers: The Heart of AMD Innovation - by Dr. Lisa...
 
Final lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tbFinal lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tb
 
A Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate ArraysA Primer on FPGAs - Field Programmable Gate Arrays
A Primer on FPGAs - Field Programmable Gate Arrays
 
HSA HSAIL Introduction Hot Chips 2013
HSA HSAIL Introduction  Hot Chips 2013 HSA HSAIL Introduction  Hot Chips 2013
HSA HSAIL Introduction Hot Chips 2013
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 

Mehr von AMD Developer Central

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsAMD Developer Central
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesAMD Developer Central
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceAMD Developer Central
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...AMD Developer Central
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellAMD Developer Central
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornAMD Developer Central
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 

Mehr von AMD Developer Central (20)

DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
Leverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math LibrariesLeverage the Speed of OpenCL™ with AMD Math Libraries
Leverage the Speed of OpenCL™ with AMD Math Libraries
 
Introduction to Node.js
Introduction to Node.jsIntroduction to Node.js
Introduction to Node.js
 
Media SDK Webinar 2014
Media SDK Webinar 2014Media SDK Webinar 2014
Media SDK Webinar 2014
 
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAn Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
 
DirectGMA on AMD’S FirePro™ GPUS
DirectGMA on AMD’S  FirePro™ GPUSDirectGMA on AMD’S  FirePro™ GPUS
DirectGMA on AMD’S FirePro™ GPUS
 
Webinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop IntelligenceWebinar: Whats New in Java 8 with Develop Intelligence
Webinar: Whats New in Java 8 with Develop Intelligence
 
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
 
Inside XBox- One, by Martin Fuller
Inside XBox- One, by Martin FullerInside XBox- One, by Martin Fuller
Inside XBox- One, by Martin Fuller
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnellRendering Battlefield 4 with Mantle by Yuriy ODonnell
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
Inside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin FullerInside XBOX ONE by Martin Fuller
Inside XBOX ONE by Martin Fuller
 
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave OldcornDirect3D12 and the Future of Graphics APIs by Dave Oldcorn
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
 
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...Productive OpenCL Programming An Introduction to OpenCL Libraries  with Array...
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 

Kürzlich hochgeladen

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 

Kürzlich hochgeladen (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

HSA-4024, OpenJDK Sumatra Project: Bringing the GPU to Java, by Eric Caspole

  • 1. AMD’S  PROTOTYPE  HSAIL-­‐ENABLED   JDK8  FOR  THE  OPENJDK  SUMATRA   PROJECT   APU’13   ERIC  CASPOLE  –  AMD  SERVER  RUNTIMES  TEAM      
  • 2. AGENDA   !  Java  and  Sumatra  OpenJDK  project   !  GPU  workload  fundamentals   !  AMD  APU  and  Heterogeneous  System  Architecture  (HSA)   !  AMD  HSAIL-­‐enabled  offload  demo  JDK   !  Summary   2   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 3. WHY  JAVA?   !  Java  by  the  numbers     ‒ 9  Million  Developers   ‒ 1  Billion  Java  downloads  per  year   ‒ 97%    Enterprise  desktops  run  Java   ‒ 100%    of  blue  ray  players  ship  with  Java   hap://oracle.com.edgesuite.net/emeline/java/   !  Java  7  language  &  libraries  already  include  concurrency  features     ‒ primieves  (threads,  locks,  monitors,  atomic  ops)   ‒ libraries  (fork/join,  thread  pools,  executors,  futures)   !  Upcoming  Java  8  include  stream  processing  enhancements   ‒ support  for  ‘lambda’    expressions     ‒ Lambda  centric  concurrent  stream  processing  libs/apis     (java.uel.stream.*)       3   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 4. SUMATRA  OPENJDK  PROJECT   !  Intending  to  enable  Java  applicaeons  to  take  advantage  of  GPU/APU   ‒  More  or  less  transparently  to  the  applicaeon   ‒  No  applicaeon  naeve  code  required   !  Project  started  by  Oracle  and  AMD  shortly  before  JavaOne  2012   !  GPU/APUs  offer  a  lot  of  processing  power   ‒  2000  ASCI  RED,  Sandia  Naeonal  Laboratories   ‒   World’s  #1  supercomputer   ‒   hap://www.top500.org/system/ranking/4428     ‒  ~3,200  GFLOPS   ‒  2013  AMD  Radeon™  HD  7990   ‒  Released  April  2013,  about  $700  on  amazon.com   ‒  ~8200  GFLOPS   !  HSA/OpenCL/CUDA  standardize  how  to  express  both  the  GPU  compute  and  host  programming   requirements   ‒  But  not  easy  to  use  from  Java  without  a  lot  of  naeve  code  and  experese   ‒  Exiseng  APIs  include  Aparapi,  JOCL,  OpenCL4Java,  and  others   4   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 5. IDEALLY, WE CAN TARGET COMPUTE AT THE MOST SUITABLE DEVICE CPU  excels  at  sequenCal,  branchy  code,  I/O   interacCon,  system  programming.   Most  Java  applicaCons  have  these  characterisCcs   and  excel  on  the  CPU.   GPU  excels  at  data-­‐parallel  tasks,  image   processing,  and  data  analysis.   Java  is  used  in  these  areas/domains,  but  does  not   exploit  the  capabiliCes  of  the  GPU  as  a  compute   device.   Graphics Workloads Serial/Task-parallel Workloads 5   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL   Other Highly Parallel Workloads
  • 6. IDEAL DATA PARALLEL ALGORITHMS/WORKLOADS   !  GPU  SIMDs  are  opemized  for  data-­‐parallel  operaeons   ‒  Performing  the  same  sequence  of  operaeons  on  different  data  at  the  same  eme   ‒  Each  GPU  core  gets  a  unique  work  item  id,  ouen  used  as  an  array  index   !  The  body  of  loops  are  a  good  place  to  look  for  data-­‐parallel  opportuniees   // Each loop iteration is independent for (int i=0; i< 100; i++) out[i] = in[i]*in[i]; !  As  a  JDK  8  Stream  operaeon:   ‒  This  is  a  thread-­‐safe  calculaeon  and  could  be  a  parallel  stream   IntStream.range(0, in.length).forEach( p -> { out[p] = in[p] * in[p]; }); !  Parecularly  if  we  can  loop  in  any  order  and  get  same  result   for (int i=99; i<= 0; i--) out[i] = in[i]*in[i]; 6   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 7. WATCH OUT FOR DEPENDENCIES AND BOTTLENECKS   !  Data dependencies can violate the “in any order” guideline // for loop style for (int i=1; i<100; i++) { out[i] = out[i-1] + in[i]; } // stream style IntStream.range(0, in.length).forEach( p -> { out[p] = out[p-1] * in[p]; }); !  Mutating shared data can force use of atomic constructs ‒  Note lambdas do not allow modifying captured values for (int i=0; i< 100; i++) sum += in[i]; 7   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 8. MEET  HSA  AND  HSAIL   !  Heterogeneous  System  Architecture  standardizes  CPU/GPU  funceonality   ‒ Be  ISA-­‐agnosec  for  both  CPUs  and  accelerators   ‒ Support  high-­‐level  programming  languages   ‒ Provide  the  ability  to  access  pageable  system  memory  from  the  GPU   ‒ Maintain  cache  coherency  for  system  memory  between  CPU  and  GPU   !  Specificaeons  and  simulator  from  HSA  Foundaeon   ‒ HSAIL  portable  ISA  is    “finalized”  to  parecular  hardware  ISA  at  runeme   ‒ runeme  specificaeon  for  job  launch  and  control   ‒ HSAIL  simulator  for  development  and  teseng  before  hardware  availability   8   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 9. AMD  ACCELERATED  PROCESSING  UNIT   !  AMD  APU   ‒  CPU/GPU  on  one  integrated  chip   ‒  Various  APU  models  shipping  since  June  2011   ‒  The  upcoming  “Kaveri”  APU  will  be  the  first  to  support  HSA  souware  stack   !  HSA  makes  a  great  playorm  for  Java  offload   ‒  Direct  access  to  Java  heap  objects  in  main  memory  from  GPU  cores   ‒  No  extra  copying  over  bus  to  discrete  card   ‒  Pointer  is  a  pointer  from  CPU  or  GPU  applicaeon  code   9   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 10. AMD  hUMA  ARCHITECTURE   !  Upcoming  AMD  APUs  feature  heterogeneous  Uniform  Memory  Access   ‒  Designed  to  work  with  HSA   ‒  Pointer  is  a  pointer  from  CPU  or  GPU  applicaeon  code  -­‐-­‐  no  copying  over  a  bus   10   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 11. AMD  SUMATRA  PROTOTYPE:  APU  MEETS  STREAM  API   !  Enables  HSA  APU  offload  of  some  JDK  8  parallel  stream  lambdas   ‒  Use  of  parallel() means  developer  thinks  it’s  thread-­‐safe   ‒  No  special  API  or  coding  requirements  for  applicaeon  developer   !  We  are  adding  HSAIL  support  to  Graal   ‒  Basic  HSAIL  funceonality  already  commiaed  into  Graal  project   !  We  hook  into  java.util.stream.ForEachOp  to  redirect  to  our  HSA  offload  code   ‒  ForEach  “side  effect”  operaeon  fits  well  with  GPU  data-­‐parallel  model   ‒  Do  math,  set  field  values,  but  no  allocaeon  or  synchronizaeon  yet   ‒  Direct  access  to  Java  objects  in  the  heap  from  GPU  cores   !  Seamless  fallback  to  regular  JDK  code  if  code  gen  or  offload  fails   !  This  code  available  in  Graal  and  a  JDK  webrev  to  be  built  together   !  Can  be  easily  run  in  open-­‐source  HSA  simulator  on  Linux  systems  without  a  GPU   11   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 12. AMD  SUMATRA  PROTOTYPE:  DIAGRAM   !  Our  JDK  uses  open-­‐source  HSAIL  tools   !  OKRA  is  a  layer  allowing  easy  use  of  the  HSAIL  tools  from   Java,  included  in  Github  simulator  repository   !  HSAIL  tools  assemble  and  finalize  the  HSAIL  source   emiaed  by  Graal   !  OKRA  passes  arguments  to  HSA  Runeme  and  runs  kernel   Java  ApplicaCon       IntStream.range(1024).forEach(p -> {/* lambda */}); Graal     emits  HSAIL   OKRA   JNI     HSA  Kernel   {/* lambda */} 12   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL   JDK  8  Stream  API   Modified  ForEach   OKRA   Finalizes  kernel     using  HSA  Tools     HSA  RunCme   runs  kernel     on  APU  or  simulator  
  • 13. HOW  IT  WORKS   !  APUs  have  hundreds  of  GPU  cores   ‒  HSA  workitem  id  is  used  as  array  index  for  each  GPU  core   ‒  Each  core  does  one  workitem  per  wavefront   ‒  Think  of  it  as  hundreds  of  threads,  each  running  one  funceon  per  invocaeon   !  This  JDK  allows  IntStream  or  Object Array/Vector/ArrayList  stream  offload   ‒  We  added  an  extra  class  into  java.util.stream  to  handle  our  extra  stream  processing   ‒  Stream  source  object  array  passed  as  hidden  parameter  to  HSA   ‒  Object  Stream  kernel  receives  array  ref  and  uses  work  item  id  as  array  index   ‒  Regular  CPU  lambda  code  receives  Object  as  its  parameter   ‒  IntStream  range  comes  from  HSA  workitem  id  itself   !  Collect  the  lambda  target  method  at  ForEachOp  diversion  point   ‒  Send  lambda  method  to  Graal  HSAIL  compiler   ‒  Graal  emits  HSAIL  text  then  sent  to  HSA  Finalizer  for  kernel  creaeon   ‒  Kernel  is  cached  for  subsequent  execueons   13   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 14. RUNNING  GRAAL  HSAIL  USING  SIMULATOR  IN  NETBEANS  IDE   !  This  code  is  available  now  in  Github  and  OpenJDK  and  you  can  have  it  running  in  the  IDE  in  15  minutes   14   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 15. MORE  DETAILS   !  Lambda  arguments  collected  from  consumer  object  created  by  stream  API   ‒  Captured  args  passed  as  parameters  to  HSA  kernel  same  as  CPU  code   !  Referenced  fields  are  accessed  through  memory  ops  like  CPU-­‐compiled  methods   ‒  Offsets  into  objects  computed  by  Graal  same  as  CPU  codegen   !  Staec  fields  accessed  through  JNI  indirect  reference   ‒  No  finalized  code  patching  at  this  eme,  so  no  GC-­‐changeable  embedded  constants   !  OKRA  is  a  temporary  interface  to  interact  with  HSA  Runeme   ‒  Java  thread  calls  our  OKRA  JNI  code  and  blocks  while  kernel  runs   ‒  OKRA  is  designed  to  work  well  with  the  HSA  simulator   15   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 16. IntStream  EXAMPLE   !  Offload  baseball  staesecs  using  IntStream ‒  Player  objects  have  accessors  for  various  stat  categories   ‒  Calculate  the  ba{ng  average  for  each  player   ‒  IntStream.forEach  lambda  code  in  red  is  converted  to  HSA  kernel   Player[] players; // Player array initialization omitted IntStream.range(0, players.length).parallel().forEach(n -> { Player p = players[n]; if (p.getAb() > 0) { p.setBa((float)p.getHits() / (float)p.getAb()); } else { p.setBa((float) 0.0); } }); 16   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 17. HSAIL  FOR  IntStream  LAMBDA  FROM  GRAAL   version 0:95: $full : $large; // static method HotSpotMethod<Main.lambda$7(Player[], int)> kernel &run ( kernarg_u64 %_arg0 ) { ld_kernarg_u64 $d6, [%_arg0]; // Captured array ref workitemabsid_u32 $s2, 0; // work item id is a gpu idiom @L4: @L6: @L8: @L9: @L7: @L5: }; ld_global_s32 $s0, [$d6 + 16]; cmp_ge_b1_u32 $c0, $s2, $s0; cbr $c0, @L5; cvt_s64_s32 $d0, $s2; mul_s64 $d0, $d0, 8; add_u64 $d3, $d6, $d0; ld_global_u64 $d0, [$d3 + 24]; mov_b64 $d3, $d0; ld_global_s32 $s3, [$d0 + 20]; cmp_lt_b1_s32 $c0, 0, $s3; cbr $c0, @L7; mov_b32 $s16, 0.0f; st_global_f32 $s16, [$d0 + 76]; ret; ld_global_s32 $s1, [$d0 + 28]; cvt_f32_s32 $s16, $s1; cvt_f32_s32 $s17, $s3; div_f32 $s16, $s16, $s17; st_global_f32 $s16, [$d0 + 76]; brn @L9; ret;     17   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL   // load array length // compare length to workitemid // return if greater // convert work item into array index // load player object // this is inlined getAb() // if (p.getAb() > 0) // p.setBa((float) 0.0); // // // // // inlined getHits() cast (float)p.getHits() cast (float)p.getAb() hits / ab inlined setBa()
  • 18. SMALL  OBJECT  STREAM  EXAMPLE   !  Same  example  as  Object  Stream   ‒  The  Stream.forEach  lambda  is  converted  to  an  HSA  kernel   ‒  Stream  source  array  is  passed  as  a  hidden  parameter  to  kernel   Stream<Player> s = Arrays.stream(allHitters).parallel(); s.forEach(p -> { if (p.getAb() > 0) { p.setBa((float)p.getHits() / (float)p.getAb()); } else { p.setBa((float)0.0); } }); 18   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 19. HSAIL  FOR  OBJECT  STREAM  LAMBDA   version 0:95: $full : $large; // static method HotSpotMethod<Main.lambda$3(Player)> kernel &run ( kernarg_u64 %_arg0 ) { ld_kernarg_u64 $d6, [%_arg0]; // Hidden stream source array ref workitemabsid_u32 $s2, 0; cvt_u64_s32 $d2, $s2; mul_u64 $d2, $d2, 8; add_u64 $d2, $d2, 24; add_u64 $d2, $d2, $d6; ld_global_u64 $d6, [$d2]; // // // // // Convert work item id to long Adjust index for sizeof ref Adjust for actual elements data start Add to array ref ptr Load from array element into parameter reg @L0: ld_global_s32 $s0, [$d6 + 20]; cmp_lt_b1_s32 $c0, 0, $s0; cbr $c0, @L1; // inlined getAb() // if (p.getAb() > 0) @L2: mov_b32 $s16, 0.0f; st_global_f32 $s16, [$d6 + 76]; // p.setBa((float)0.0); @L3: ret; @L1: ld_global_s32 $s3, [$d6 + 28]; cvt_f32_s32 $s16, $s3; cvt_f32_s32 $s17, $s0; div_f32 $s16, $s16, $s17; st_global_f32 $s16, [$d6 + 76]; brn @L3; }; 19   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL   // load p.getHits() // (float) p.getHits() // (float) p.getAb() // inlined setBa()
  • 20. CURRENT  LIMITATIONS  OF  HSAIL  OFFLOAD  DEMO  JDK   !  Currently  not  allowed  in  an  offloaded  kernel   ‒  No  heap  allocaeon   ‒  No  excepeon  handling  or  try/catch  inside  a  kernel   ‒  No  calling  methods  that  would  be  a  JNI  or  runeme  call   ‒  No  synchronizaeon  in  kernels   ‒  No  method  handles  in  target  lambda  methods   !  Kernels  are  called  by  JNI  code  using  JNI  Criecal   ‒  So  no  GC  during  kernel  execueon   ‒  Finalized  kernels  cannot  support  a  GC  a  this  eme 20   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 21. FEATURES  WE  HOPE  TO  STANDARDIZE  IN  SUMATRA   !  What  is  the  heurisec  or  coding  model  for  offloading?   ‒  We  chose  parallel  streams  based  on  our  experience  with  Aparapi  and  GPUs   ‒  This  model  does  not  require  developers  to  learn  new  API,  etc.   !  GC  interaceon?   ‒  Possible  or  worthwhile  to  have  safepoints  during  kernel  execueon?   !  What  runeme  calls  or  allocaeon  from  a  kernel  can  be  supported?   ‒  Runeme  calls  imply  pausing  the  GPU  kernel  and  resuming  on  the  CPU   !  Excepeon  handling?   ‒  Throw  inside  kernel  with  its  own  try-­‐catch  block  handling  it   ‒  Throw  causing  kernel  abort  and  handled  in  runeme  on  CPU   !  What  synchronizaeon  can  be  supported  in  kernels?   ‒  Between  GPU  cores   ‒  Between  CPU  and  GPU   21   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 22. FEATURES  WE  HOPE  TO  STANDARDIZE  IN  SUMATRA   !  Details  of  HSA  versus  discrete  card  offload?   ‒  Copying/replacing  buffers  to  card  vs.  direct  heap  access  in  HSA   ‒  Any  difference  in  interaceon  with  JVM  runeme?   !  How  to  detect  and  configure  various  offload  runeme  systems  from  Java?   ‒  HSAIL/BRIG,  PTX,  etc.   ‒  Select  offload  GPU(s)  if  more  than  one  available   22   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 23. SUMMARY   !  We  can  offload  simple  JDK  8  Stream  API  forEach  lambdas  to  HSA  systems   ‒  Seamlessly  offload  normal  JDK  8  code   ‒  No  special  coding  or  API  required   !  Basic  HSAIL  code  generaeon  now  in  Graal  repository   !  HSAIL  simulator  is  available  and  our  HSAIL  demo  JDK  uses  it   ‒  Detailed  check-­‐out  and  build  instruceons  on  the  Sumatra  wiki:  haps://wiki.openjdk.java.net/display/Sumatra/Main   !  GPU  offload  for  Java  is  here   ‒  GPUs  offer  unprecedented  performance  for  the  appropriate  workload   ‒  Don’t  assume  everything  can/should  execute  on  the  GPU   ‒  Look  for  “islands  of  parallel  in  a  sea  of  sequeneal”   !  Lots  of  work  remains!   23   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 24. LINKS  AND  REFERENCES   !  Sumatra  OpenJDK  GPU/APU  offload  project   ‒  Project  home  page:  hap://openjdk.java.net/projects/sumatra/   ‒  Wiki:  haps://wiki.openjdk.java.net/display/Sumatra/Main   !  Graal  JIT  compiler  and  runeme  project   ‒  Project  home  page:  hap://openjdk.java.net/projects/graal/   !  HSA  Foundaeon   ‒  Home  page:  hap://hsafoundaeon.com/   ‒  Specificaeons  at  hap://hsafoundaeon.com/standards/   !  “Kaveri”  APU  Overview   ‒  hap://www.theregister.co.uk/2013/05/01/amd_huma/   24   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL  
  • 25. DISCLAIMER  &  ATTRIBUTION   The  informaeon  presented  in  this  document  is  for  informaeonal  purposes  only  and  may  contain  technical  inaccuracies,  omissions  and  typographical  errors.     The  informaeon  contained  herein  is  subject  to  change  and  may  be  rendered  inaccurate  for  many  reasons,  including  but  not  limited  to  product  and  roadmap   changes,  component  and  motherboard  version  changes,  new  model  and/or  product  releases,  product  differences  between  differing  manufacturers,  souware   changes,  BIOS  flashes,  firmware  upgrades,  or  the  like.  AMD  assumes  no  obligaeon  to  update  or  otherwise  correct  or  revise  this  informaeon.  However,  AMD   reserves  the  right  to  revise  this  informaeon  and  to  make  changes  from  eme  to  eme  to  the  content  hereof  without  obligaeon  of  AMD  to  noefy  any  person  of   such  revisions  or  changes.     AMD  MAKES  NO  REPRESENTATIONS  OR  WARRANTIES  WITH  RESPECT  TO  THE  CONTENTS  HEREOF  AND  ASSUMES  NO  RESPONSIBILITY  FOR  ANY   INACCURACIES,  ERRORS  OR  OMISSIONS  THAT  MAY  APPEAR  IN  THIS  INFORMATION.     AMD  SPECIFICALLY  DISCLAIMS  ANY  IMPLIED  WARRANTIES  OF  MERCHANTABILITY  OR  FITNESS  FOR  ANY  PARTICULAR  PURPOSE.  IN  NO  EVENT  WILL  AMD  BE   LIABLE  TO  ANY  PERSON  FOR  ANY  DIRECT,  INDIRECT,  SPECIAL  OR  OTHER  CONSEQUENTIAL  DAMAGES  ARISING  FROM  THE  USE  OF  ANY  INFORMATION   CONTAINED  HEREIN,  EVEN  IF  AMD  IS  EXPRESSLY  ADVISED  OF  THE  POSSIBILITY  OF  SUCH  DAMAGES.     ATTRIBUTION   ©  2013  Advanced  Micro  Devices,  Inc.  All  rights  reserved.  AMD,  the  AMD  Arrow  logo  and  combinaeons  thereof  are  trademarks  of  Advanced  Micro  Devices,   Inc.  in  the  United  States  and/or  other  jurisdiceons.    SPEC    is  a  registered  trademark  of  the  Standard  Performance  Evaluaeon  Corporaeon  (SPEC).  Other   names  are  for  informaeonal  purposes  only  and  may  be  trademarks  of  their  respeceve  owners.   25   |      PRESENTATION  TITLE      |      NOVEMBER  19,  2013      |      CONFIDENTIAL