SlideShare ist ein Scribd-Unternehmen logo
1 von 90
Downloaden Sie, um offline zu lesen
Understanding Android
Benchmarks
“freedom” koan-sin tan	

freedom@computer.org	

OSDC.tw,Taipei	

Apr 11th, 2014
1
disclaimers
• many of the materials used in this slide
deck are from the Internet and textbooks,
e.g., many of the following materials are
from “Computer Architecture: A
Quantitative Approach,” 1st ~ 5th ed	

• opinions expressed here are my personal
one, don’t reflect my employer’s view
2
who am i
• did some networking and security research before	

• working for a SoC company, recently on	

• big.LITTLE scheduling and related stuff	

• parallel construct evaluation	

• run benchmarking from time to time	

• for improving performance of our products, and	

• know what our colleagues' progress
3
• Focusing on CPU and memory parts of
benchmarks	

• let’s ignore graphics (2d, 3d), storage I/O,
etc.
4
Blackbox
!
• google image search “benchmark”, you can
find many of them are Android-related
benchmarks	

• Similar to recently Cross-Strait Trade in
Services Agreement (TiSA), most
benchmarks on Android platform are kinda
blackbox
5
Is Apple A7 good?
• When Apple released the new
iPhone 5s, you saw many
technical blog showed some
benchmarks for reviews they
came up	

• commonly used ones:	

• GeekBench	

• JavaScript benchmarks	

• Some graphics benchmarks	

• Why? Are they right ones? etc.
e.g., http://www.anandtech.com/show/7335/the-iphone-5s-review
6
open blackbox
7
Android Benchmarks
8
No, not improvement in this way
9
http://
www.anandtech.com
/show/7384/state-of-
cheating-in-android-
benchmarks
Assuming there is not
cheating, what we we
can do?
Outline
• Performance benchmark review
• Some Android benchmarks	

• What we did and what still can be done	

• Future
11
To quote what Prof. Raj Jain quoted
• Benchmark v. trans.
To subject (a system)
to a series of tests in
order to obtain
prearranged results
not available on
competitive systems
From:“The Devil’s DP
Dictionary” S. Kelly-Bootle
12
Why benchmarking
• We did something good, let check if we did
it right	

• comparing with own previous results to
see if we break anything	

• We want to know how good our
colleagues in other places are
13
What to report?
• Usually, what we mean by “benchmarking”
is to measure performance	

• What to report?	

• intuitive answer: how many things we do
in certain period of time	

• yes, time. E.g., MIPS, MFLOPS, MiB/s, bps
14
MIPS and MFLOPS
• MIPS	
  (Million	
  Instruc0ons	
  per	
  Second),	
  MFLOPS	
  (Million	
  
Floa0ng-­‐Point	
  Opera0ons	
  per	
  Second)	
  
• All	
  instruc0ons	
  are	
  not	
  created	
  equal	
  
– CISC	
  machine	
  instruc0ons	
  usually	
  accomplish	
  a	
  lot	
  more	
  than	
  
those	
  of	
  RISC	
  machines,	
  comparing	
  the	
  instruc0ons	
  of	
  a	
  CISC	
  
machine	
  and	
  a	
  RISC	
  machine	
  is	
  similar	
  to	
  comparing	
  La0n	
  and	
  
Greek
15
MIPS	
  and	
  what’s	
  wrong	
  with	
  them
• MIPS	
  is	
  instruc0on	
  set	
  dependent,	
  making	
  it	
  
difficult	
  to	
  compare	
  MIPS	
  of	
  one	
  computers	
  with	
  
different	
  ISA	
  
• MIPS	
  varies	
  between	
  programs	
  on	
  the	
  same	
  
computers;	
  	
  and	
  most	
  importantly,	
  
• MIPS	
  can	
  vary	
  inversely	
  to	
  performance	
  
–w/	
  hardware	
  FP,	
  generally,	
  MIPS	
  is	
  smaller
16
MFLOPS	
  and	
  what’s	
  wrong	
  with	
  them
• Applied	
  only	
  to	
  programs	
  with	
  floa0ng-­‐point	
  
opera0ons	
  
• Opera0ons	
  instead	
  of	
  instruc0ons,	
  but	
  s0ll	
  	
  
–floa0ng-­‐point	
  instruc0ons	
  are	
  different	
  on	
  machines	
  
different	
  ISAs	
  
–Fast	
  and	
  slow	
  floa0ng-­‐point	
  opera0ons	
  
• Possible	
  solu0on:	
  weight	
  and	
  source	
  code	
  level	
  
count	
  
–ADD,	
  SUB,	
  COMPARE	
  :	
  1	
  
–DIVIDE,	
  	
  SQRT:	
  2	
  
–EXP,	
  SIN:	
  4
17
• The	
  best	
  choice	
  of	
  benchmarks	
  to	
  measure	
  
performance	
  is	
  real	
  applica0ons
18
Problema0c	
  benchmarks
• Kernel:	
  small,	
  key	
  pieces	
  of	
  real	
  applica0ons,	
  e.g.,	
  
linpack	
  
• Toy	
  programs:	
  100-­‐line	
  programs	
  from	
  beginning	
  
programming	
  assignments,	
  e.g.,	
  quicksort	
  
• Synthe0c	
  benchmarks:	
  fake	
  programs	
  invented	
  to	
  
try	
  to	
  match	
  the	
  profile	
  and	
  behavior	
  of	
  really	
  
applica0ons,	
  e.g.,	
  Dhrystone
19
Why	
  they	
  are	
  disreputed?
• Small,	
  fit	
  in	
  cache	
  
• Obsolete	
  instruc0on	
  mix	
  
• Uncontrolled	
  source	
  code	
  
• Prone	
  to	
  compiler	
  tricks	
  
• Short	
  run0mes	
  on	
  modern	
  machines	
  
• Single-­‐number	
  performance	
  characteriza0on	
  with	
  a	
  
single	
  benchmark	
  
• Difficult	
  to	
  reproduce	
  results	
  (short	
  run0me	
  and	
  
low-­‐precision	
  UNIX	
  0mer)
20
Dhrystone
• Source	
  
–hhp://homepages.cwi.nl/~steven/dry.c	
  
• <	
  1000	
  LoC	
  
–Size	
  of	
  CA15	
  binary	
  compiled	
  with	
  bionic	
  	
  
• Instruc0ons:	
  ~	
  14	
  KiB
text data bss dec
13918 467 10266 24660
21
Whetstone
• Dhrystone	
  is	
  a	
  pun	
  on	
  
Whetstone	
  
• Source	
  code:	
  hhp://
www.netlib.org/
benchmark/whetstone.c
Test MFLOPS MOPS ms
N1 float 119.78 0.16
N2 float 171.98 0.78
N3 if 154.25 0.67
N4 fixpt 397.48 0.79
N5 cos 19.08 4.36
N6 float 84.22 6.41
N7 equal 86.84 2.13
N8 exp 5.95 6.26
MWIPS 463.97 21.55
22
More	
  on	
  Synthe0c	
  benchmarks
• The	
  best	
  known	
  examples	
  of	
  synthe0c	
  benchmarks	
  are	
  Whetstone	
  and	
  
Dhrystone	
  	
  
• Problems:	
  
– Compiler	
  and	
  hardware	
  op0miza0ons	
  can	
  ar0ficially	
  inflate	
  performance	
  of	
  these	
  
benchmarks	
  but	
  not	
  of	
  real	
  programs	
  
– The	
  other	
  side	
  of	
  the	
  coin	
  is	
  that	
  because	
  these	
  benchmarks	
  are	
  not	
  natural	
  programs,	
  
they	
  don’t	
  reward	
  op0miza0ons	
  of	
  behaviors	
  that	
  occur	
  in	
  real	
  programs	
  
• Examples:	
  
– Op0mizing	
  compilers	
  can	
  discard	
  25%	
  of	
  the	
  Dhrystone	
  code;	
  examples	
  include	
  loops	
  
that	
  are	
  only	
  executed	
  once,	
  making	
  the	
  loop	
  overhead	
  instruc0ons	
  unnecessary	
  
– Most	
  Whetstone	
  floa0ng-­‐point	
  loops	
  execute	
  small	
  numbers	
  of	
  0mes	
  or	
  include	
  calls	
  
inside	
  the	
  loop.	
  These	
  characteris0cs	
  are	
  different	
  from	
  many	
  real	
  programs	
  
– Some	
  more	
  discussion	
  in	
  1st	
  edi0on	
  of	
  the	
  textbook
23
LINPACK
• LINPACK:	
  a	
  floa0ng	
  point	
  benchmark	
  from	
  the	
  
manual	
  of	
  LINPACK	
  library	
  
• Source	
  
–hhp://www.netlib.org/benchmark/linpackc	
  
–hhp://www.netlib.org/benchmark/linpackc.new	
  
• 883	
  LoC	
  
–Size	
  of	
  CA15	
  binary	
  compiled	
  with	
  bionic	
  
• Instruc0ons:	
  ~	
  13	
  KiB
text data bss dec
12670 408 0 13086
24
25
CoreMark	
  (1/2)
• CoreMark	
  is	
  a	
  benchmark	
  that	
  aims	
  to	
  measure	
  the	
  
performance	
  of	
  central	
  processing	
  units	
  (CPU)	
  used	
  
in	
  embedded	
  systems.	
  It	
  was	
  developed	
  in	
  2009	
  by	
  Shay	
  Gal-­‐On	
  
at	
  EEMBC	
  and	
  is	
  intended	
  to	
  become	
  an	
  industry	
  standard,	
  
replacing	
  the	
  an0quated	
  Dhrystone	
  benchmark	
  
• The	
  code	
  is	
  wrihen	
  in	
  C	
  code	
  and	
  contains	
  implementa0ons	
  of	
  
the	
  following	
  algorithms:	
  	
  
– Linked	
  list	
  processing.	
  
– Matrix	
  (mathema0cs)	
  manipula0on	
  (common	
  matrix	
  opera0ons),	
  
– state	
  machine	
  (determine	
  if	
  an	
  input	
  stream	
  contains	
  valid	
  numbers),	
  
and	
  
– CRC	
  
• from	
  wikipedia
26
CoreMark	
  (2/2)
name LoC
core_list_join.c 496
core_matrix.c 308
core_stat.c 277
core_util.c 210
• CoreMark	
  vs.	
  Dhrystone	
  
–Repor0ng	
  rule	
  
–Use	
  of	
  library	
  calls,	
  e.g.,	
  
malloc()	
  is	
  avoided	
  
–CRC	
  to	
  make	
  sure	
  data	
  are	
  
corrected	
  
• However,	
  CoreMark	
  is	
  a	
  	
  
kernel	
  +	
  synthe0c	
  
benchmark,	
  s0ll	
  quite	
  
small	
  footprint
text data bss dec
18632 456 20 19108
27
So?
• Too	
  overcome	
  the	
  danger	
  of	
  placing	
  eggs	
  in	
  one	
  
basket,	
  collec0ons	
  of	
  benchmark	
  applica0ons,	
  
called	
  benchmark	
  suites,	
  are	
  popular	
  measure	
  of	
  
performance	
  of	
  processors	
  with	
  variety	
  of	
  
applica0ons	
  
• Standard	
  Performance	
  Evalua0on	
  Corpora0on	
  
(SPEC)
28
29
Why	
  CPU2000	
  in	
  2010s?
• Why	
  ARM	
  s0cks	
  with	
  SPEC	
  CPU2000	
  instead	
  of	
  
CPU2006	
  
–1999	
  q4	
  results,	
  earliest	
  available	
  CPU2000	
  results	
  (hhp://
www.spec.org/cpu2000/results/res1999q4/)	
  
• CINT2000	
  base:	
  133	
  –	
  424	
  	
  
• CFP2000	
  base:	
  126	
  –	
  514	
  
–2005	
  Opteron	
  144,	
  1.8	
  GHz	
  
• 1,440	
  (CA15	
  1.9	
  GHz	
  reported	
  nVidia	
  is	
  1,168)	
  
–CPU2006	
  requires	
  much	
  more	
  DRAM,	
  1	
  GiB	
  DRAM	
  is	
  not	
  
enough
name CA9 CA7 CA15 Krait
SPECint 200 356 320 537 326
SPECfp 2000 298 236 567 350
All normalized to 1.0 GHz
30
SPEC	
  numbers	
  from	
  Quan0ta0ve	
  
Approach	
  5th	
  Edi0on
31
How	
  long	
  does	
  SPEC	
  CPU2000	
  take?
• About	
  1	
  hrs	
  to	
  compile	
  
• Run0me:	
  Sum	
  of	
  base	
  
run0me	
  mul0plied	
  by	
  3	
  
– E.g.,	
  1.7	
  GHz	
  CA15,	
  
(2256+3229)	
  x	
  3	
  =	
  	
  16,455	
  s	
  ~=	
  
4.57	
  hr	
  
– For	
  1.0	
  GHz:	
  4.57	
  x	
  1.7	
  =	
  7.77	
  
hr	
  
– For	
  CA7	
  assuming	
  twice	
  slower:	
  
7.77	
  *	
  2	
  =	
  15.54	
  hr
Benchmark
Reference Base Base
Time Runtime Ratio
164.gzip 1400 215 652
175.vpr 1400 198 707
176.gcc 1100 94.8 1161
181.mcf 1800 266 677
186.crafty 1000 118 850
197.parser 1800 291 619
252.eon 1300 87.8 1480
253.perlbmk 1800 172 1045
254.gap 1100 107 1026
255.vortex 1900 211 899
256.bzip2 1500 203 740
300.twolf 3000 399 752
SPECint_base2000 2256 854
Benchmark
Reference Base Base
Time Runtime Ratio
68.wupwise 1600 162 991
171.swim 3100 389 797
172.mgrid 1800 339 532
173.applu 2100 241 870
177.mesa 1400 112 1254
178.galgel 2900 201 1444
179.art 2600 195 1332
183.equake 1300 157 828
187.facerec 1900 183 1036
188.ammp 2200 353 623
189.lucas 2000 134 1491
191.fma3d 2100 212 988
200.sixtrack 1100 241 456
301.apsi 2600 310 839
SPECfp_base2000 435     3229 909.6
32
Figure	
  1.16	
  SPEC2006	
  programs	
  and	
  the	
  evolu0on	
  of	
  the	
  SPEC	
  benchmarks	
  over	
  0me,	
  with	
  integer	
  programs	
  above	
  the	
  line	
  and	
  floa0ng-­‐point	
  
programs	
  below	
  the	
  line.	
  Of	
  the	
  12	
  SPEC2006	
  integer	
  programs,	
  9	
  are	
  wrihen	
  in	
  C,	
  and	
  the	
  rest	
  in	
  C++.	
  For	
  the	
  floa0ng-­‐point	
  programs,	
  the	
  split	
  is	
  6	
  
in	
  Fortran,	
  4	
  in	
  C++,	
  3	
  in	
  C,	
  and	
  4	
  in	
  mixed	
  C	
  and	
  Fortran.	
  The	
  figure	
  shows	
  all	
  70	
  of	
  the	
  programs	
  in	
  the	
  1989,	
  1992,	
  1995,	
  2000,	
  and	
  2006	
  releases.	
  
The	
  benchmark	
  descrip0ons	
  on	
  the	
  les	
  are	
  for	
  SPEC2006	
  only	
  and	
  do	
  not	
  apply	
  to	
  earlier	
  versions.	
  Programs	
  in	
  the	
  same	
  row	
  from	
  different	
  
genera0ons	
  of	
  SPEC	
  are	
  generally	
  not	
  related;	
  for	
  example,	
  fpppp	
  is	
  not	
  a	
  CFD	
  code	
  like	
  bwaves.	
  Gcc	
  is	
  the	
  senior	
  ci0zen	
  of	
  the	
  group.	
  Only	
  3	
  integer	
  
programs	
  and	
  3	
  floa0ng-­‐point	
  programs	
  survived	
  three	
  or	
  more	
  genera0ons.	
  Note	
  that	
  all	
  the	
  floa0ng-­‐point	
  programs	
  are	
  new	
  for	
  SPEC2006.	
  
Although	
  a	
  few	
  are	
  carried	
  over	
  from	
  genera0on	
  to	
  genera0on,	
  the	
  version	
  of	
  the	
  program	
  changes	
  and	
  either	
  the	
  input	
  or	
  the	
  size	
  of	
  the	
  benchmark	
  
is	
  osen	
  changed	
  to	
  increase	
  its	
  running	
  0me	
  and	
  to	
  avoid	
  perturba0on	
  in	
  measurement	
  or	
  domina0on	
  of	
  the	
  execu0on	
  0me	
  by	
  some	
  factor	
  other	
  
than	
  CPU	
  0me.	
  
33
EEMBC
• Embedded	
  Microprocessor	
  Benchmark	
  Consor0um	
  	
  (EEMBC):	
  41	
  kernels	
  
used	
  to	
  predict	
  performance	
  of	
  different	
  embedded	
  applica0ons:	
  
– Automo0ve/industrial	
  
– Consumer	
  
– Networking	
  
– Office	
  automa0on	
  
– Telecommunica0on	
  
• 3rd	
  edi0on	
  showed	
  some	
  EEMBC	
  results,	
  4th	
  edi0on	
  changed	
  the	
  mind	
  
• Unmodified	
  performance	
  and	
  “full-­‐fury”	
  performance	
  
• Kernel,	
  repor0ng	
  op0ons	
  
– Not	
  a	
  good	
  predictor	
  of	
  rela0ve	
  performance	
  of	
  different	
  embedded	
  computers
34
Report	
  benchmark	
  results
• Reproducible	
  
–Machine	
  configura0on	
  (Hardware,	
  sosware	
  (OS,	
  compiler	
  etc.))	
  
• Summarizing	
  results	
  
–You	
  should	
  not	
  add	
  different	
  numbers	
  
• Some	
  use	
  weighted	
  average	
  
–Ra0o,	
  compare	
  with	
  a	
  reference	
  machine	
  
• Geometric	
  ra1o	
  
–The	
  geometric	
  mean	
  of	
  the	
  ra0os	
  is	
  the	
  same	
  as	
  the	
  ra0os	
  of	
  
geometric	
  means	
  
–The	
  ra0o	
  of	
  the	
  geometric	
  means	
  is	
  equal	
  to	
  the	
  geometric	
  mean	
  
of	
  the	
  performance	
  ra0os
35
Geometric	
  mean
36
• Fallacy:	
  Benchmarks	
  remain	
  valid	
  indefinitely	
  
–Ability	
  to	
  resist	
  “benchmark	
  engineering”	
  or	
  
“benchmarke0ng”	
  
–gcc	
  is	
  the	
  only	
  survivor	
  from	
  SPEC89	
  
• Almost	
  70%	
  of	
  all	
  programs	
  from	
  SPEC2000	
  or	
  earlier	
  were	
  
dropped	
  from	
  the	
  next	
  release
37
Other	
  benchmarks
• Stream	
  
–To	
  test	
  memory	
  bandwidth	
  
–It	
  also	
  tests	
  floa0ng-­‐point	
  performance	
  
–Op0ons	
  of	
  floa0ng-­‐point	
  (double,	
  8	
  bytes)	
  array	
  
• copy,	
  scale,	
  add,	
  triad	
  
• lmbench	
  
–Micro	
  benchmark	
  to	
  measure	
  sosware/hardware	
  
overhead	
  from	
  sosware	
  perspec0ve	
  
–lmbench	
  paper	
  (1996),	
  hhp://www.bitmover.com/
lmbench/lmbench-­‐usenix.pdf
name kernel bytes/iter FLOPS/iter
COPY a(i) = b(i) 16 0
SCALE a(i) = q*b(i) 16 1
SUM a(i) = b(i) + c(i) 24 1
TRIAD a(i) = b(i) + q*c(i) 24 2
38
Stream 5.10
for (k=0; k<NTIMES; k++)
{
times[0][k] = mysecond();
for (j=0; j<STREAM_ARRAY_SIZE; j++)
c[j] = a[j];
times[0][k] = mysecond() - times[0][k];
times[1][k] = mysecond();
for (j=0; j<STREAM_ARRAY_SIZE; j++)
b[j] = scalar*c[j];
times[1][k] = mysecond() - times[1][k];
times[2][k] = mysecond();
for (j=0; j<STREAM_ARRAY_SIZE; j++)
c[j] = a[j]+b[j];
times[2][k] = mysecond() - times[2][k];
times[3][k] = mysecond();
for (j=0; j<STREAM_ARRAY_SIZE; j++)
a[j] = b[j]+scalar*c[j];
times[3][k] = mysecond() - times[3][k];
}
39
lmbench
• lmbench	
  is	
  a	
  micro-­‐benchmark	
  suite	
  designed	
  to	
  
focus	
  ahen0on	
  on	
  the	
  basic	
  building	
  blocks	
  of	
  
many	
  common	
  system	
  applica0ons,	
  such	
  as	
  
databases,	
  simula0ons,	
  sosware	
  development,	
  
and	
  networking
40
Parallel?	
  Let’s	
  look	
  at	
  other	
  SPEC	
  
benchmarks
• SPECapc	
  for	
  3ds	
  Max™	
  2011,	
  performance	
  evalua0on	
  sosware	
  for	
  systems	
  running	
  Autodesk	
  3ds	
  Max	
  2011.	
  	
  
• SPECapcSM	
  for	
  Lightwave	
  3D	
  9.6,	
  performance	
  evalua0on	
  sosware	
  for	
  systems	
  running	
  NewTek	
  LightWave	
  3D	
  v9.6	
  
sosware.	
  	
  
• SPECjbb2005,	
  evaluates	
  the	
  performance	
  of	
  server	
  side	
  Java	
  by	
  emula0ng	
  a	
  three-­‐0er	
  client/server	
  system	
  (with	
  
emphasis	
  on	
  the	
  middle	
  0er).	
  	
  
• SPECjEnterprise2010,	
  a	
  mul0-­‐0er	
  benchmark	
  for	
  measuring	
  the	
  performance	
  of	
  Java	
  2	
  Enterprise	
  Edi0on	
  (J2EE)	
  
technology-­‐based	
  applica0on	
  servers.	
  	
  
• SPECjms2007,	
  Java	
  Message	
  Service	
  performance	
  	
  
• SPECjvm2008,	
  measuring	
  basic	
  Java	
  performance	
  of	
  a	
  Java	
  Run0me	
  Environment	
  on	
  a	
  wide	
  variety	
  of	
  both	
  client	
  and	
  
server	
  systems.	
  	
  
• SPECapc,	
  performance	
  of	
  several	
  3D-­‐intensive	
  popular	
  applica0ons	
  on	
  a	
  given	
  system	
  	
  
• SPEC	
  MPI2007,	
  for	
  evalua0ng	
  performance	
  of	
  parallel	
  systems	
  using	
  MPI	
  (Message	
  Passing	
  Interface)	
  applica0ons.	
  	
  
• SPEC	
  OMP2001	
  V3.2,	
  for	
  evalua0ng	
  performance	
  of	
  parallel	
  systems	
  using	
  OpenMP	
  (hhp://www.openmp.org)	
  
applica0ons.	
  	
  
• SPECpower_ssj2008,	
  evaluates	
  the	
  energy	
  efficiency	
  of	
  server	
  systems.	
  	
  
• SPECsfs2008,	
  File	
  server	
  throughput	
  and	
  response	
  0me	
  suppor0ng	
  both	
  NFS	
  and	
  CIFS	
  protocol	
  access	
  	
  
• SPECsip_Infrastructure2011,	
  SIP	
  server	
  performance	
  	
  
• SPECviewperf	
  11,	
  performance	
  of	
  an	
  OpenGL	
  3D	
  graphics	
  system,	
  tested	
  with	
  various	
  rendering	
  tasks	
  from	
  real	
  
applica0ons	
  	
  
• SPECvirt_sc2010	
  ("SPECvirt"),	
  evaluates	
  the	
  performance	
  of	
  datacenter	
  servers	
  used	
  in	
  virtualized	
  server	
  consolida0on	
  
41
PARSEC
• The	
  Princeton	
  Applica0on	
  
Repository	
  for	
  Shared-­‐Memory	
  
Computers	
  (PARSEC)	
  is	
  a	
  
benchmark	
  suite	
  composed	
  of	
  
mul0threaded	
  programs.	
  The	
  
suite	
  focuses	
  on	
  emerging	
  
workloads	
  and	
  was	
  designed	
  to	
  be	
  
representa0ve	
  of	
  next-­‐genera0on	
  
shared-­‐memory	
  programs	
  for	
  
chip-­‐mul0processors	
  
• Didn’t	
  really	
  use	
  it	
  yet	
  
• hhp://parsec.cs.princeton.edu/
Workload
Parallelization Model
Pthreads OpenMP Intel TBB
blackscholes Yes Yes Yes
bodytrack Yes Yes Yes
canneal Yes No No
dedup Yes No No
facesim Yes No No
ferret Yes No No
fluidanimate Yes No Yes
freqmine No Yes No
raytrace Yes No No
streamcluster Yes No Yes
swaptions Yes No Yes
vips Yes No No
x264 Yes No No
42
Are Dhrystone usefully?
• Yes, if you know the limitation of them	

• Don't do marketing as those benchmarks
mean real user perceived performance
43
iPhone'5s' iPhone'5s'32,bit' CA15' CA7' Krait'400'
DMIPS/MHz' 7.47'' 5.70'' 2.71'' 1.67'' 2.46''
0.00''
1.00''
2.00''
3.00''
4.00''
5.00''
6.00''
7.00''
8.00''
DMIPS/MHz)
A7 Dhrystone
44
iPhone'5s'
iPhone'5s'32,
bit'
'CA15' CA7' Krait'400'
MFLOPS/GHz' 722' 723' 449' 119' 299'
0'
100'
200'
300'
400'
500'
600'
700'
800'
MFLOPS/GHz+
A7 linpack MFLOPS
45
iPhone'5s' iPhone'5s'32,bit' CA15' CA7' Krait'400'
CoreMark/MHz' 5.72'' 4.45'' 3.67'' 2.46'' 3.30''
0.00''
1.00''
2.00''
3.00''
4.00''
5.00''
6.00''
7.00''
CoreMark/MHz+
A7 CoreMark
46
Different items
• Example, GeekBench 3	

• Arithmetic mean with different weight?
How?	

• Good properties of geometric mean
47
Source code
• So far what we talked about are all
software with source code available, either
publicly/freely, e.g., Dhrystone or little
amount of $, e.g., SPEC CPU
48
• Benchmark scores/results usually depend
on compiler, complier flags, processors, and
systems
49
Outline
• Performance benchmark review	

• Some Android benchmarks
• What we did and what still can be done	

• Future
50
Back to Android
• What kinds of Benchmarks are available, or used to
compare performance	

• Apps with native benchmarks:Antutu, GeekBench	

• Java apps, e.g., Quadrant	

• Hybrid: with both native and Java, e.g.,AndEBench
and CF-Bench	

• We also use SPEC CPU2000 and other stuff
internally
51
Ars Technica List
arrayOfPackageInfo[0]	
  =	
  new	
  PackageInfo("com.aurorasoftworks.quadrant.ui.standard",	
  false);	
  
arrayOfPackageInfo[1]	
  =	
  new	
  PackageInfo("com.aurorasoftworks.quadrant.ui.advanced",	
  false);	
  
arrayOfPackageInfo[2]	
  =	
  new	
  PackageInfo("com.aurorasoftworks.quadrant.ui.professional",	
  false);	
  
arrayOfPackageInfo[3]	
  =	
  new	
  PackageInfo("com.redlicense.benchmark.sqlite",	
  false);	
  
arrayOfPackageInfo[4]	
  =	
  new	
  PackageInfo("com.antutu.ABenchMark",	
  false);	
  
arrayOfPackageInfo[5]	
  =	
  new	
  PackageInfo("com.greenecomputing.linpack",	
  false);	
  
arrayOfPackageInfo[6]	
  =	
  new	
  PackageInfo("com.greenecomputing.linpackpro",	
  false);	
  
arrayOfPackageInfo[7]	
  =	
  new	
  PackageInfo("com.glbenchmark.glbenchmark27",	
  false);	
  
arrayOfPackageInfo[8]	
  =	
  new	
  PackageInfo("com.glbenchmark.glbenchmark25",	
  false);	
  
arrayOfPackageInfo[9]	
  =	
  new	
  PackageInfo("com.glbenchmark.glbenchmark21",	
  false);	
  
arrayOfPackageInfo[10]	
  =	
  new	
  PackageInfo("ca.primatelabs.geekbench2",	
  false);	
  
arrayOfPackageInfo[11]	
  =	
  new	
  PackageInfo("com.eembc.coremark",	
  false);	
  
arrayOfPackageInfo[12]	
  =	
  new	
  PackageInfo("com.flexycore.caffeinemark",	
  false);	
  
arrayOfPackageInfo[13]	
  =	
  new	
  PackageInfo("eu.chainfire.cfbench",	
  false);	
  
arrayOfPackageInfo[14]	
  =	
  new	
  PackageInfo("gr.androiddev.BenchmarkPi",	
  false);	
  
arrayOfPackageInfo[15]	
  =	
  new	
  PackageInfo("com.smartbench.twelve",	
  false);	
  
arrayOfPackageInfo[16]	
  =	
  new	
  PackageInfo("com.passmark.pt_mobile",	
  false);	
  
arrayOfPackageInfo[17]	
  =	
  new	
  PackageInfo("se.nena.nenamark2",	
  false);	
  
arrayOfPackageInfo[18]	
  =	
  new	
  PackageInfo("com.samsung.benchmarks",	
  false);	
  
arrayOfPackageInfo[19]	
  =	
  new	
  PackageInfo("com.samsung.benchmarks:db",	
  false);	
  
arrayOfPackageInfo[20]	
  =	
  new	
  PackageInfo("com.samsung.benchmarks:es1",	
  false);	
  
arrayOfPackageInfo[21]	
  =	
  new	
  PackageInfo("com.samsung.benchmarks:es2",	
  false);	
  
arrayOfPackageInfo[22]	
  =	
  new	
  PackageInfo("com.samsung.benchmarks:g2d",	
  false);	
  
arrayOfPackageInfo[23]	
  =	
  new	
  PackageInfo("com.samsung.benchmarks:fs",	
  false);	
  
arrayOfPackageInfo[24]	
  =	
  new	
  PackageInfo("com.samsung.benchmarks:ks",	
  false);	
  
arrayOfPackageInfo[25]	
  =	
  new	
  PackageInfo("com.samsung.benchmarks:cpu	
  
!
!
CPU and Memory related: Quadrant, Antutu,
linpack, GeekBench, AndEBench (coremark),
CaffeineMark, Pi, PassMark, Samsung’s benchmark
52
Antutu 3.x
• CPU: integer, floating point	

• memory: RAM	

• Graphics: 2D, 3D	

• I/O: Database, SD read, SD write	

!
!
• What are you benchmarking	

• What's you workload	

• How to calculate scores
53
What on earth are they
doing?
• Actually no public available information	

• But, with good enough background
knowledge and proper tools (we’ll talk
about these later), we can figure it out	

• It turns out most of them are from the
BYTE nbench (http://en.wikipedia.org/wiki/
NBench)
54
AnTuTu	
  3.x	
  CPU	
  and	
  Memory	
  Tests
nbench item Used by Antutu Antutu part
Antutu
percentage on
progress bar Order nbench category
NUMERIC SORT yes Integer 27% 4 integer
STRING SORT yes RAM 1% 1 memory
BITFIELD yes RAM 1% 2 memory
FP EMULATION no
FOURIER yes floating 47% 7 floating point
ASSIGNMENT yes RAM 8% 3 memory
IDEA yes Integer 27% 5 integer
HUFFMAN yes Integer 34% 6 integer
NEURAL NET no
LU DECOMPOSITION no
55
More	
  close	
  look
▪ RAM
– String sort:
• string Heap sort: StrHeapSort()
• MoveMemory() à memmove()
– Bit Field:
• Bit field test: DoBitops()
– Assignment:
• Task Assignment test: DoAssignment()
▪ Integer
– Numeric sort:
• Numeric heap sort: NumHeapSort()
– IDEA:
• IDEA encryption and decryption: cipher_idea()
– Huffman:
• Huffman encoding
▪ Floating point:
– Fourier:
• Fourier transform: pow(), sin(), cos()
56
for(i=top; i>0; --i)!
{!
"strsift(optrarray,strarray,numstrings,0,i);!
!
"/* temp = string[0] */!
"tlen=*strarray;!
"MoveMemory((farvoid *)&temp[0], /* Perform exchange */!
" "(farvoid *)strarray,!
" "(unsigned long)(tlen+1));!
!
!
"/* string[0]=string[i] */!
"tlen=*(strarray+*(optrarray+i));!
"stradjust(optrarray,strarray,numstrings,0,tlen);!
"MoveMemory((farvoid *)strarray,!
" "(farvoid *)(strarray+*(optrarray+i)),!
" "(unsigned long)(tlen+1));!
!
"/* string[i]=temp */!
"tlen=temp[0];!
"stradjust(optrarray,strarray,numstrings,i,tlen);!
"MoveMemory((farvoid *)(strarray+*(optrarray+i)),!
" "(farvoid *)&temp[0],!
" "(unsigned long)(tlen+1));!
!
}
String Sort in NBench
• Sorts an array of strings
of arbitrary length	

• Test memory movement
performance	

• Non-sequential
performance of cache,
with added burden that
moves are byte-wide and
can occur on odd
address boundaries
57
Bit field in NBench
• Executes 3 bit manipulation functions
• Exercises "bit twiddling“ performance. Travels through
memory bit-by-bit in a sequential fashion; different from sorts
in that data is merely altered in place
• Operations:
• Set: OR 1
• Clear: AND 0
• Toggle: XOR
• Set, clear: ToggleBitRun()
• Toggle: FlipBitRun()
static void ToggleBitRun(farulong *bitmap, /* Bitmap */
ulong bit_addr, /* Address of bits to set */
ulong nbits, /* # of bits to set/clr */
uint val) /* 1 or 0 */
{
unsigned long bindex; /* Index into array */
unsigned long bitnumb; /* Bit number */
!
while(nbits--)
{
#ifdef LONG64
bindex=bit_addr>>6; /* Index is number /64 */
bitnumb=bit_addr % 64; /* Bit number in word */
#else
bindex=bit_addr>>5; /* Index is number /32 */
bitnumb=bit_addr % 32; /* bit number in word */
#endif
if(val)
bitmap[bindex]|=(1L<<bitnumb);
else
bitmap[bindex]&=~(1L<<bitnumb);
bit_addr++;
}
return;
}
58
Assignment in NBench
• The test moves through
large integer arrays in both
row-wise and column-wise
fashion. Cache/memory
with good sequential
performance should see a
boost (memory is altered in
place -- no moving as in a
sort operation)	

• Yes, basically, sequential
array assignment with some
kind of table look-ups
/*
** Step through rows. For each one that is not currently
** assigned, see if the row has only one zero in it. If so,
** mark that as an assigned row/col. Eliminate other zeros
** in the same column.
*/
for(i=0;i<ASSIGNROWS;i++)
{ numzeros=0;
for(j=0;j<ASSIGNCOLS;j++)
if(tableau[i][j]==0L)
if(assignedtableau[i][j]==0)
{ numzeros++;
selected=j;
}
if(numzeros==1)
{ numassigns++;
totnumassigns++;
assignedtableau[i][selected]=1;
for(k=0;k<ASSIGNROWS;k++)
if((k!=i) &&
(tableau[k][selected]==0))
assignedtableau[k][selected]=2;
}
}
59
Numeric Sort in NBench
• Sorts an array of long
integers with heap sort	

• Generic integer
performance. Should
exercise non-sequential
performance of cache
(or memory if cache is
less than 8K). Moves 32-
bit longs at a time, so
16-bit processors will be
at a disadvantage
static void NumHeapSort(farlong *array,
ulong bottom, /* Lower bound */
ulong top) /* Upper bound */
{
ulong temp; /* Used to exchange elements */
ulong i; /* Loop index */
!
/*
** First, build a heap in the array
*/
for(i=(top/2L); i>0; --i)
NumSift(array,i,top);
!
/*
** Repeatedly extract maximum from heap and place it at the
** end of the array. When we get done, we'll have a sorted
** array.
*/
for(i=top; i>0; --i)
{ NumSift(array,bottom,i);
temp=*array; /* Perform
exchange */
*array=*(array+i);
*(array+i)=temp;
}
return;
60
static void cipher_idea(u16 in[4],!
" "u16 out[4],!
" "register IDEAkey Z)!
{!
register u16 x1, x2, x3, x4, t1, t2;!
/* register u16 t16;!
register u16 t32; */!
int r=ROUNDS;!
!
x1=*in++;!
x2=*in++;!
x3=*in++;!
x4=*in;!
!
do {!
"MUL(x1,*Z++);!
"x2+=*Z++;!
"x3+=*Z++;!
"MUL(x4,*Z++);!
!
"t2=x1^x3;!
"MUL(t2,*Z++);!
"t1=t2+(x2^x4);!
"MUL(t1,*Z++);!
"t2=t1+t2;!
!
"x1^=t1;!
"x4^=t2;!
!
"t2^=x2;!
"x2=x3^t1;!
"x3=t2;!
} while(--r);!
MUL(x1,*Z++);!
*out++=x1;!
*out++=x3+*Z++;!
*out++=x2+*Z++;!
MUL(x4,*Z);!
*out=x4;!
return;!
}
IDEA Encryption in NBench
• IDEA: a new block
cipher when nbench was
in development	

• Moves through data
sequentially in 16-bit
chunks
61
Huffman in NBench
• Everybody knows Huffman code, right?	

• A combination of byte operations, bit twiddling, and overall integer
manipulation
.....
/*
** Huffman tree built...compress the plaintext
*/
bitoffset=0L; /* Initialize bit offset */
for(i=0;i<arraysize;i++)
{
c=(int)plaintext[i]; /* Fetch character */
/*
** Build a bit string for byte c
*/
bitstringlen=0;
while(hufftree[c].parent!=-2)
{ if(hufftree[hufftree[c].parent].left==c)
bitstring[bitstringlen]='0';
else
bitstring[bitstringlen]='1';
c=hufftree[c].parent;
bitstringlen++;
}
.....
62
Fourier in NBench
• No, not FFT, 	

• Good measure of transcendental and trigonometric performance of FPU. Little array
activity, so this test should not be dependent of cache or memory architecture
static double thefunction(double x, /* Independent variable */!
" "double omegan, /* Omega * term */!
" "int select) /* Choose term */!
{!
/*!
** Use select to pick which function we call.!
*/!
switch(select)!
{!
"case 0: return(pow(x+(double)1.0,x));!
"case 1: return(pow(x+(double)1.0,x) * cos(omegan * x));!
"case 2: return(pow(x+(double)1.0,x) * sin(omegan * x));!
}
63
Neural Net in NBench
• A robust algorithm for
solving linear equations	

• Small-array floating-point
test heavily dependent
on the exponential
function; less dependent
on overall FPU
performance
64
LU Decomposition in NBench
• LU Decomposition	

• Yes, the LU decomposition
you learned in linear
algebra	

• A floating-point test that
moves through arrays in
both row-wise and
column-wise fashion.
Exercises only fundamental
math operations (+, -, *, /)
65
GeekBench
• A cross-platform one	

• The only publicly available one we could use to compare
Android, iOS, and other platforms	

• Quite clearly described test items	

• http://support.primatelabs.com/kb/geekbench/geekbench-3-
benchmarks	

• Explaining how to interpret results	

• http://support.primatelabs.com/kb/geekbench/interpreting-
geekbench-3-scores	

• Source code available if you pay
66
Vellamo
• HTML5	

• Metal: Dhrystone, Linpack, Branch-K, Stream
5.9, RamJam, Storage	

• some are well-known; some are written
by Quic?	

• Anyway, all of them are described at http://
www.quicinc.com/vellamo/test-descriptions/
67
CFBench
• Used by some people,‘cause	

• Test both Java and native version	

• its author is quite active in xda developer forum	

• Some problems	

• no good description of tests	

• some code is wrong, e.g., 	

• its Native Memory Read test is not testing memory
read,‘cause malloc()ed array is not initialized
68
Outline
• Performance benchmark review	

• Some Android benchmarks	

• What we did and what still can be done
• Future
69
How do we improve
benchmark
performance
70
• In the good old days, we have source code, we compile and run
benchmark programs	

• In current Android ecosystem	

• Usually we don’t have source	

• Profiling: oprofile, perf, DS-5	

• profiling sometimes doesn’t report real bottleneck
function, e.g., static functions usually are inlined and don’t
have symbol in shipped binaries	

• binutils: nm, readelf, objdump, gdb	

• Improving libraries, e.g., libc and libm, and runtime system, e.g.,
JIT of Dalvik, used by those benchmarks
71
Antutu 3.x
• memmove() in bionic --> bcopy() in C	

• rewrite with NEON assembly code	

• pow(), sin(), cos() in C	

• rewrite them with assembly
72
bcopy() in bionic
• MoveMemory() in nbench
-> memmove() in bionic -
> bcopy() in bionic	

• memcpy() assembly in
bionic and there are
processor specific ones
(CA9, CA15, Krait).
NEON (vector load/
store) helps	

• not for bcopy()
in bionic/libc/bionic/memmove.c
!
void *memmove(void *dst, const void *src, size_t n)
{
const char *p = src;
char *q = dst;
/* We can use the optimized memcpy if the source and destination
* don't overlap.
*/
if (__builtin_expect(((q < p) && ((size_t)(p - q) >= n))
|| ((p < q) && ((size_t)(q - p) >= n)), 1)) {
return memcpy(dst, src, n);
} else {
bcopy(src, dst, n);
return dst;
}
}
in bionic/libc/string/bcopy.c
/*
* Copy a block of memory, handling overlap.
* This is the routine that actually implements
* (the portable versions of) bcopy, memcpy, and memmove.
*/
#ifdef MEMCOPY
void *
memcpy(void *dst0, const void *src0, size_t length)
#else
#ifdef MEMMOVE
void *
memmove(void *dst0, const void *src0, size_t length)
#else
void
bcopy(const void *src0, void *dst0, size_t length)
#endif
#endif
{
.....
73
Antutu 3.x
• For people with source code	

• Selection of toolchain and compiler options
may cause huge difference, e.g., bit field	

• Some version of x86 binary for Antutu
3.x was compiled with Intel, bit-by-bit
operations turned in word-wide (32-bit)
operations, and the speed up is about 70x
faster
74
Stream copy usually turned into
memcpy()
75
remote gdb
1. get /system/bin/app_process and /system/bin/linker of the target system and necessary
shared libraries, e.g., /data/data/eu.chainfire.cfbench/lib/libCFBench.so	

• adb pull /system/bin/app_process!
• adb pull /system/bin/linker lib/armeabi-v7a/!
• adb pull /data/data/eu.chainfire.cfbench/lib/libCFBench.so lib/
armeabi-v7a/!
2. arm-linux-gnueabi-gdb ./app_process	

3. on the target device, attach gdbserver to the running process you wanna debug	

• ./gdbserver --attach :5039 3484	

4. set shared library search path	

• (gdb) set solib-search-path /Users/freedom/tmp/cfbench/lib/armeabi-v7a	

5. ‘adb forward tcp:5039 tcp:5039’ and set remote target	

• (gdb) target remote :5039	

6. you can set breakpoints, print backtrace, disassemble, etc.
76
• (gdb) b Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned	

• (gdb) disassemble
Dump of assembler code for function Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned:	

0x74b65848 <+0>: stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, lr}	

=> 0x74b6584c <+4>: bl 0x74b654ac <loadLib>	

0x74b65850 <+8>: mov.w r0, #1048576 ; 0x100000	

0x74b65854 <+12>: blx 0x74b65358	

0x74b65858 <+16>: movs r6, #0	

0x74b6585a <+18>: movw r9, #9999 ; 0x270f	

0x74b6585e <+22>: mov r8, r0	

0x74b65860 <+24>: bl 0x74b6547c <getTickCount>	

0x74b65864 <+28>: add.w r5, r8, #1048576 ; 0x100000	

0x74b65868 <+32>: mov r10, r0	

0x74b6586a <+34>: mov r3, r8	

0x74b6586c <+36>: ldr.w r2, [r3], #4	

0x74b65870 <+40>: cmp r3, r5	

0x74b65872 <+42>: add r4, r2	

0x74b65874 <+44>: bne.n 0x74b6586c <Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned+36>	

0x74b65876 <+46>: bl 0x74b6547c <getTickCount>	

0x74b6587a <+50>: adds r6, #1	

0x74b6587c <+52>: rsb r7, r10, r0	

0x74b65880 <+56>: cmp r7, r9	

77
Quadrant
• Written in Java	

• CPU: Not really testing CPU	

• Memory: profiling shows that memcpy() is
heavily in used	

• What can we do	

• optimized JIT part of DVM
78
What other possible
ways?
• binary translation during	

• installation time	

• run time
79
Wrap-up
• Popular CPU and Memory benchmarks on
Android mostly don’t reflect real CPU
performance	

• We know CPU performance != System
performance != user-perceived
performance	

• There is always room for improvement
80
So?
81
Recent progress
• EEMBC’s AndEBench 2.0 is under development (http://
www.eembc.org/press/pressrelease/130128.html)	

• Qualcomm asked BDTi to develop new benchmark
(http://www.qualcomm.com/media/blog/2013/08/16/
mobile-benchmarking-turning-corner-user-
experience).	

• Samsung with other vendors launched MobileBench
Consortium last year	

• Antutu is still growing
82
Thanks!
廣告
• MediaTek joined
linaro.org last month	

• linaro.org is a NPO
working on open source
Linux/Android related
stuff for ARM-based
SoCs	

• So MTK is getting more
open recently	

• And, it’s looking for
open source engineers	

• Talk to guys at MTK
booth or me	

• There are more non-
open source jobs
84
backup
85
Some References to Understand
Performance Benchmark
• Raj Jain,“The Art of Computer Systems Performance
Analysis:Techniques for Experimental Design,
Measurement, Simulation, and Modeling”,Wiley, 1991	

• Quantitative Approach	

• A good SPEC introduction article, http://mrob.com/
pub/comp/benchmarks/spec.html	

• Kaivalya M. Dixit,“Overview of the SPEC
Benchmarks,” http://people.cs.uchicago.edu/~chliu/
doc/benchmark/chapter9.pdf
86
Basic system parameters	
  
------------------------------------------------------------------------------	
  
Host OS Description Mhz tlb cache mem scal	
  
pages line par load	
  
bytes 	
  
--------- ------------- ----------------------- ---- ----- ----- ------ ----	
  
localhost Linux 3.4.5-g armv7l-linux-gnu 1696 7 64 4.4700 1	
  
!
Processor, Processes - times in microseconds - smaller is better	
  
------------------------------------------------------------------------------	
  
Host OS Mhz null null open slct sig sig fork exec sh 	
  
call I/O stat clos TCP inst hndl proc proc proc	
  
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----	
  
localhost Linux 3.4.5-g 1696 0.49 0.67 2.54 5.95 8.52 0.67 5.05 876. 1668 4654	
  
!
Basic integer operations - times in nanoseconds - smaller is better	
  
-------------------------------------------------------------------	
  
Host OS intgr intgr intgr intgr intgr 	
  
bit add mul div mod 	
  
--------- ------------- ------ ------ ------ ------ ------ 	
  
localhost Linux 3.4.5-g 1.0700 0.1100 3.4000 90.5 14.8	
  
!
Basic float operations - times in nanoseconds - smaller is better	
  
-----------------------------------------------------------------	
  
87
Context switching - times in microseconds - smaller is better	
  
-------------------------------------------------------------------------	
  
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K	
  
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw	
  
--------- ------------- ------ ------ ------ ------ ------ ------- -------	
  
localhost Linux 3.4.5-g 8.9700 4.9000 6.1400 12.3 7.68000 57.6	
  
!
*Local* Communication latencies in microseconds - smaller is better	
  
---------------------------------------------------------------------	
  
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP	
  
ctxsw UNIX UDP TCP conn	
  
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----	
  
localhost Linux 3.4.5-g 8.970 17.6 23.9 47.5 71.3 357.	
  
!
File & VM system latencies in microseconds - smaller is better	
  
-------------------------------------------------------------------------------	
  
Host OS 0K File 10K File Mmap Prot Page 100fd	
  
Create Delete Create Delete Latency Fault Fault selct	
  
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----	
  
localhost Linux 3.4.5-g 700.0 1.259 2.55270 3.048	
  
!
*Local* Communication bandwidths in MB/s - bigger is better	
  
-----------------------------------------------------------------------------	
  
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem	
  
88
PARSEC	
  content
• Blackscholes	
  	
  This	
  applica0on	
  is	
  an	
  Intel	
  RMS	
  benchmark.	
  It	
  calculates	
  the	
  prices	
  for	
  a	
  por|olio	
  of	
  European	
  op0ons	
  
analy0cally	
  with	
  the	
  Black-­‐Scholes	
  par1al	
  differen1al	
  equa1on	
  (PDE).	
  There	
  is	
  no	
  closed-­‐form	
  expression	
  for	
  the	
  Black-­‐
Scholes	
  equa0on	
  and	
  as	
  such	
  it	
  must	
  be	
  computed	
  numerically.	
  	
  
• Bodytrack	
  This	
  computer	
  vision	
  applica0on	
  is	
  an	
  Intel	
  RMS	
  workload	
  which	
  tracks	
  a	
  human	
  body	
  with	
  mul1ple	
  cameras	
  
through	
  an	
  image	
  sequence.	
  This	
  benchmark	
  was	
  included	
  due	
  to	
  the	
  increasing	
  significance	
  of	
  computer	
  vision	
  
algorithms	
  in	
  areas	
  such	
  as	
  video	
  surveillance,	
  character	
  anima0on	
  and	
  computer	
  interfaces.	
  	
  
• Canneal	
  	
  This	
  kernel	
  was	
  developed	
  by	
  Princeton	
  University.	
  It	
  uses	
  cache-­‐aware	
  simulated	
  annealing	
  (SA)	
  to	
  minimize	
  
the	
  rou1ng	
  cost	
  of	
  a	
  chip	
  design.	
  Canneal	
  uses	
  fine-­‐grained	
  parallelism	
  with	
  a	
  lock-­‐free	
  algorithm	
  and	
  a	
  very	
  aggressive	
  
synchroniza0on	
  strategy	
  that	
  is	
  based	
  on	
  data	
  race	
  recovery	
  instead	
  of	
  avoidance.	
  	
  
• Dedup	
  This	
  kernel	
  was	
  developed	
  by	
  Princeton	
  University.	
  It	
  compresses	
  a	
  data	
  stream	
  with	
  a	
  combina1on	
  of	
  global	
  and	
  
local	
  compression	
  that	
  is	
  called	
  'deduplica1on'.	
  The	
  kernel	
  uses	
  a	
  pipelined	
  programming	
  model	
  to	
  mimic	
  real-­‐world	
  
implementa0ons.	
  The	
  reason	
  for	
  the	
  inclusion	
  of	
  this	
  kernel	
  is	
  that	
  deduplica0on	
  has	
  become	
  a	
  mainstream	
  method	
  for	
  
new-­‐genera0on	
  backup	
  storage	
  systems.	
  	
  
• Facesim	
  This	
  Intel	
  RMS	
  applica0on	
  was	
  originally	
  developed	
  by	
  Stanford	
  University.	
  It	
  computes	
  a	
  visually	
  realis1c	
  
anima1on	
  of	
  the	
  modeled	
  face	
  by	
  simula1ng	
  the	
  underlying	
  physics.	
  The	
  workload	
  was	
  included	
  in	
  the	
  benchmark	
  suite	
  
because	
  an	
  increasing	
  number	
  of	
  anima0ons	
  employ	
  physical	
  simula0on	
  to	
  create	
  more	
  realis0c	
  effects.	
  	
  
• Ferret	
  	
  This	
  applica0on	
  is	
  based	
  on	
  the	
  Ferret	
  toolkit	
  which	
  is	
  used	
  for	
  content-­‐based	
  similarity	
  search.	
  It	
  was	
  developed	
  
by	
  Princeton	
  University.	
  The	
  reason	
  for	
  the	
  inclusion	
  in	
  the	
  benchmark	
  suite	
  is	
  that	
  it	
  represents	
  emerging	
  next-­‐
genera0on	
  search	
  engines	
  for	
  non-­‐text	
  document	
  data	
  types.	
  In	
  the	
  benchmark,	
  we	
  have	
  configured	
  the	
  Ferret	
  toolkit	
  for	
  
image	
  similarity	
  search.	
  Ferret	
  is	
  parallelized	
  using	
  the	
  pipeline	
  model.
89
PARSEC	
  content
• Fluidanimate	
  	
  This	
  Intel	
  RMS	
  applica0on	
  uses	
  an	
  extension	
  of	
  the	
  Smoothed	
  Par0cle	
  Hydrodynamics	
  (SPH)	
  method	
  to	
  
simulate	
  an	
  incompressible	
  fluid	
  for	
  interac1ve	
  anima1on	
  purposes.	
  It	
  was	
  included	
  in	
  the	
  PARSEC	
  benchmark	
  suite	
  
because	
  of	
  the	
  increasing	
  significance	
  of	
  physics	
  simula0ons	
  for	
  anima0ons.	
  	
  
• Freqmine	
  	
  This	
  applica0on	
  employs	
  an	
  array-­‐based	
  version	
  of	
  the	
  FP-­‐growth	
  (Frequent	
  PaMern-­‐growth)	
  method	
  for	
  
Frequent	
  Itemset	
  Mining	
  (FIMI).	
  It	
  is	
  an	
  Intel	
  RMS	
  benchmark	
  which	
  was	
  originally	
  developed	
  by	
  Concordia	
  University.	
  
Freqmine	
  was	
  included	
  in	
  the	
  PARSEC	
  benchmark	
  suite	
  because	
  of	
  the	
  increasing	
  use	
  of	
  data	
  mining	
  techniques.	
  	
  
• Raytrace	
  	
  The	
  Intel	
  RMS	
  applica0on	
  uses	
  a	
  version	
  of	
  the	
  raytracing	
  method	
  that	
  would	
  typically	
  be	
  employed	
  for	
  real-­‐
0me	
  anima0ons	
  such	
  as	
  computer	
  games.	
  It	
  is	
  op0mized	
  for	
  speed	
  rather	
  than	
  realism.	
  The	
  computa0onal	
  complexity	
  of	
  
the	
  algorithm	
  depends	
  on	
  the	
  resolu0on	
  of	
  the	
  output	
  image	
  and	
  the	
  scene.	
  	
  
• Streamcluster	
  	
  This	
  RMS	
  kernel	
  was	
  developed	
  by	
  Princeton	
  University	
  and	
  solves	
  the	
  online	
  clustering	
  problem.	
  
Streamcluster	
  was	
  included	
  in	
  the	
  PARSEC	
  benchmark	
  suite	
  because	
  of	
  the	
  importance	
  of	
  data	
  mining	
  algorithms	
  and	
  the	
  
prevalence	
  of	
  problems	
  with	
  streaming	
  characteris0cs.	
  	
  
• Swap1ons	
  	
  The	
  applica0on	
  is	
  an	
  Intel	
  RMS	
  workload	
  which	
  uses	
  the	
  Heath-­‐Jarrow-­‐Morton	
  (HJM)	
  framework	
  to	
  price	
  a	
  
porRolio	
  of	
  swap1ons.	
  Swap0ons	
  employs	
  Monte	
  Carlo	
  (MC)	
  simula0on	
  to	
  compute	
  the	
  prices.	
  	
  
• Vips	
  	
  This	
  applica0on	
  is	
  based	
  on	
  the	
  VASARI	
  Image	
  Processing	
  System	
  (VIPS)	
  which	
  was	
  originally	
  developed	
  through	
  
several	
  projects	
  funded	
  by	
  European	
  Union	
  (EU)	
  grants.	
  The	
  benchmark	
  version	
  is	
  derived	
  from	
  a	
  print	
  on	
  demand	
  service	
  
that	
  is	
  offered	
  at	
  the	
  Na0onal	
  Gallery	
  of	
  London,	
  which	
  is	
  also	
  the	
  current	
  maintainer	
  of	
  the	
  system.	
  The	
  benchmark	
  
includes	
  fundamental	
  image	
  opera0ons	
  such	
  as	
  an	
  affine	
  transforma0on	
  and	
  a	
  convolu0on.	
  	
  
• X264
90

Weitere ähnliche Inhalte

Was ist angesagt?

Android Internals at Linaro Connect Asia 2013
Android Internals at Linaro Connect Asia 2013Android Internals at Linaro Connect Asia 2013
Android Internals at Linaro Connect Asia 2013Opersys inc.
 
ARM IoT Firmware Emulation Workshop
ARM IoT Firmware Emulation WorkshopARM IoT Firmware Emulation Workshop
ARM IoT Firmware Emulation WorkshopSaumil Shah
 
U boot porting guide for SoC
U boot porting guide for SoCU boot porting guide for SoC
U boot porting guide for SoCMacpaul Lin
 
Scheduling in Android
Scheduling in AndroidScheduling in Android
Scheduling in AndroidOpersys inc.
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막
(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막
(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막Jay Park
 
[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅NAVER D2
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
IPC in Microkernel Systems, Capabilities
IPC in Microkernel Systems, CapabilitiesIPC in Microkernel Systems, Capabilities
IPC in Microkernel Systems, CapabilitiesMartin Děcký
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Brendan Gregg
 
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014Davidlohr Bueso
 
Android Boot Time Optimization
Android Boot Time OptimizationAndroid Boot Time Optimization
Android Boot Time OptimizationKan-Ru Chen
 
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014Amazon Web Services
 
RedHat OpenStack Platform Overview
RedHat OpenStack Platform OverviewRedHat OpenStack Platform Overview
RedHat OpenStack Platform Overviewindevlab
 

Was ist angesagt? (20)

BusyBox for Embedded Linux
BusyBox for Embedded LinuxBusyBox for Embedded Linux
BusyBox for Embedded Linux
 
Android Internals at Linaro Connect Asia 2013
Android Internals at Linaro Connect Asia 2013Android Internals at Linaro Connect Asia 2013
Android Internals at Linaro Connect Asia 2013
 
ARM IoT Firmware Emulation Workshop
ARM IoT Firmware Emulation WorkshopARM IoT Firmware Emulation Workshop
ARM IoT Firmware Emulation Workshop
 
U boot porting guide for SoC
U boot porting guide for SoCU boot porting guide for SoC
U boot porting guide for SoC
 
Scheduling in Android
Scheduling in AndroidScheduling in Android
Scheduling in Android
 
Split lock
Split lockSplit lock
Split lock
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막
(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막
(알도개) GraalVM – 자바를 넘어선 새로운 시작의 서막
 
Android Internals
Android InternalsAndroid Internals
Android Internals
 
[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
IPC in Microkernel Systems, Capabilities
IPC in Microkernel Systems, CapabilitiesIPC in Microkernel Systems, Capabilities
IPC in Microkernel Systems, Capabilities
 
Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016Broken Linux Performance Tools 2016
Broken Linux Performance Tools 2016
 
Embedded Android : System Development - Part IV
Embedded Android : System Development - Part IVEmbedded Android : System Development - Part IV
Embedded Android : System Development - Part IV
 
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
An Overview of [Linux] Kernel Lock Improvements -- Linuxcon NA 2014
 
Android Boot Time Optimization
Android Boot Time OptimizationAndroid Boot Time Optimization
Android Boot Time Optimization
 
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
 
Firmadyne
FirmadyneFirmadyne
Firmadyne
 
RedHat OpenStack Platform Overview
RedHat OpenStack Platform OverviewRedHat OpenStack Platform Overview
RedHat OpenStack Platform Overview
 

Andere mochten auch

Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensAlona Gradman
 
BKK16-410 SoC Idling & CPU Cluster PM
BKK16-410 SoC Idling & CPU Cluster PMBKK16-410 SoC Idling & CPU Cluster PM
BKK16-410 SoC Idling & CPU Cluster PMLinaro
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016Koan-Sin Tan
 
Learning&Teaching Systems Ppt
Learning&Teaching Systems PptLearning&Teaching Systems Ppt
Learning&Teaching Systems PptKyle
 
Hcs Topic 2 Computer Structure V2
Hcs Topic 2  Computer Structure V2Hcs Topic 2  Computer Structure V2
Hcs Topic 2 Computer Structure V2Kyle
 
Harvard architecture
Harvard architectureHarvard architecture
Harvard architectureCarmen Ugay
 
Von Neumann vs Harvard Architecture
Von Neumann vs Harvard ArchitectureVon Neumann vs Harvard Architecture
Von Neumann vs Harvard ArchitectureOLSON MATUNGA
 
RISC Vs CISC, Harvard v/s Van Neumann
RISC Vs CISC, Harvard v/s Van NeumannRISC Vs CISC, Harvard v/s Van Neumann
RISC Vs CISC, Harvard v/s Van NeumannRavikumar Tiwari
 

Andere mochten auch (12)

Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert Goossens
 
BKK16-410 SoC Idling & CPU Cluster PM
BKK16-410 SoC Idling & CPU Cluster PMBKK16-410 SoC Idling & CPU Cluster PM
BKK16-410 SoC Idling & CPU Cluster PM
 
SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016SoC Idling for unconf COSCUP 2016
SoC Idling for unconf COSCUP 2016
 
Learning&Teaching Systems Ppt
Learning&Teaching Systems PptLearning&Teaching Systems Ppt
Learning&Teaching Systems Ppt
 
Hcs Topic 2 Computer Structure V2
Hcs Topic 2  Computer Structure V2Hcs Topic 2  Computer Structure V2
Hcs Topic 2 Computer Structure V2
 
Computer Measures of Performance
Computer Measures of PerformanceComputer Measures of Performance
Computer Measures of Performance
 
RISC AND CISC PROCESSOR
RISC AND CISC PROCESSORRISC AND CISC PROCESSOR
RISC AND CISC PROCESSOR
 
Harvard architecture
Harvard architectureHarvard architecture
Harvard architecture
 
Von Neumann vs Harvard Architecture
Von Neumann vs Harvard ArchitectureVon Neumann vs Harvard Architecture
Von Neumann vs Harvard Architecture
 
CISC & RISC Architecture
CISC & RISC Architecture CISC & RISC Architecture
CISC & RISC Architecture
 
RISC Vs CISC, Harvard v/s Van Neumann
RISC Vs CISC, Harvard v/s Van NeumannRISC Vs CISC, Harvard v/s Van Neumann
RISC Vs CISC, Harvard v/s Van Neumann
 
performance measures
performance measuresperformance measures
performance measures
 

Ähnlich wie Understanding Android Benchmarks

What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performancePiotr Przymus
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksSenturus
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
 
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit EbnerThe NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit EbnerNRB
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginnersGerwin Makanyanga
 
Microchip's PIC Micro Controller
Microchip's PIC Micro ControllerMicrochip's PIC Micro Controller
Microchip's PIC Micro ControllerMidhu S V Unnithan
 
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf....NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...Karel Zikmund
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented DesignRodrigo Campos
 
Fundamentals.pptx
Fundamentals.pptxFundamentals.pptx
Fundamentals.pptxdhivyak49
 
L-2 (Computer Performance).ppt
L-2 (Computer Performance).pptL-2 (Computer Performance).ppt
L-2 (Computer Performance).pptImranKhan997082
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedTim Callaghan
 
Tool Up Your LAMP Stack
Tool Up Your LAMP StackTool Up Your LAMP Stack
Tool Up Your LAMP StackLorna Mitchell
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 RecapSri Ambati
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to WorkSingleStore
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
introduction COA(M1).pptx
introduction COA(M1).pptxintroduction COA(M1).pptx
introduction COA(M1).pptxBhavanaMinchu
 

Ähnlich wie Understanding Android Benchmarks (20)

What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
computer architecture.
computer architecture.computer architecture.
computer architecture.
 
Cognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & TricksCognos Performance Tuning Tips & Tricks
Cognos Performance Tuning Tips & Tricks
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit EbnerThe NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
The NRB Group mainframe day 2021 - DevOps on Z - Jerome Klimm - Benoit Ebner
 
Basics of micro controllers for biginners
Basics of  micro controllers for biginnersBasics of  micro controllers for biginners
Basics of micro controllers for biginners
 
Microchip's PIC Micro Controller
Microchip's PIC Micro ControllerMicrochip's PIC Micro Controller
Microchip's PIC Micro Controller
 
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf....NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
 
Fundamentals.pptx
Fundamentals.pptxFundamentals.pptx
Fundamentals.pptx
 
L-2 (Computer Performance).ppt
L-2 (Computer Performance).pptL-2 (Computer Performance).ppt
L-2 (Computer Performance).ppt
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons Learned
 
Tool up your lamp stack
Tool up your lamp stackTool up your lamp stack
Tool up your lamp stack
 
Tool Up Your LAMP Stack
Tool Up Your LAMP StackTool Up Your LAMP Stack
Tool Up Your LAMP Stack
 
03 performance
03 performance03 performance
03 performance
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to Work
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
introduction COA(M1).pptx
introduction COA(M1).pptxintroduction COA(M1).pptx
introduction COA(M1).pptx
 

Mehr von Koan-Sin Tan

running stable diffusion on android
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on androidKoan-Sin Tan
 
Exploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source ToolsExploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source ToolsKoan-Sin Tan
 
Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020Koan-Sin Tan
 
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source ToolExploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source ToolKoan-Sin Tan
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesKoan-Sin Tan
 
A Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlowA Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlowKoan-Sin Tan
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPUKoan-Sin Tan
 
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Koan-Sin Tan
 
open source nn frameworks on cellphones
open source nn frameworks on cellphonesopen source nn frameworks on cellphones
open source nn frameworks on cellphonesKoan-Sin Tan
 
Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Koan-Sin Tan
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on AndroidKoan-Sin Tan
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserKoan-Sin Tan
 
Android Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchAndroid Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchKoan-Sin Tan
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsKoan-Sin Tan
 
Smalltalk and ruby - 2012-12-08
Smalltalk and ruby  - 2012-12-08Smalltalk and ruby  - 2012-12-08
Smalltalk and ruby - 2012-12-08Koan-Sin Tan
 

Mehr von Koan-Sin Tan (17)

running stable diffusion on android
running stable diffusion on androidrunning stable diffusion on android
running stable diffusion on android
 
Exploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source ToolsExploring Your Apple M1 devices with Open Source Tools
Exploring Your Apple M1 devices with Open Source Tools
 
A Peek into TFRT
A Peek into TFRTA Peek into TFRT
A Peek into TFRT
 
Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020Running TFLite on Your Mobile Devices, 2020
Running TFLite on Your Mobile Devices, 2020
 
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source ToolExploring Thermal Related Stuff in iDevices using Open-Source Tool
Exploring Thermal Related Stuff in iDevices using Open-Source Tool
 
TFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU DelegatesTFLite NNAPI and GPU Delegates
TFLite NNAPI and GPU Delegates
 
A Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlowA Sneak Peek of MLIR in TensorFlow
A Sneak Peek of MLIR in TensorFlow
 
A Peek into Google's Edge TPU
A Peek into Google's Edge TPUA Peek into Google's Edge TPU
A Peek into Google's Edge TPU
 
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
Why You Cannot Use Neural Engine to Run Your NN Models on A11 Devices?
 
open source nn frameworks on cellphones
open source nn frameworks on cellphonesopen source nn frameworks on cellphones
open source nn frameworks on cellphones
 
Caffe2 on Android
Caffe2 on AndroidCaffe2 on Android
Caffe2 on Android
 
Introduction to TensorFlow Lite
Introduction to TensorFlow Lite Introduction to TensorFlow Lite
Introduction to TensorFlow Lite
 
Tensorflow on Android
Tensorflow on AndroidTensorflow on Android
Tensorflow on Android
 
A peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk UserA peek into Python's Metaclass and Bytecode from a Smalltalk User
A peek into Python's Metaclass and Bytecode from a Smalltalk User
 
Android Wear and the Future of Smartwatch
Android Wear and the Future of SmartwatchAndroid Wear and the Future of Smartwatch
Android Wear and the Future of Smartwatch
 
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source SolutionsDark Silicon, Mobile Devices, and Possible Open-Source Solutions
Dark Silicon, Mobile Devices, and Possible Open-Source Solutions
 
Smalltalk and ruby - 2012-12-08
Smalltalk and ruby  - 2012-12-08Smalltalk and ruby  - 2012-12-08
Smalltalk and ruby - 2012-12-08
 

Kürzlich hochgeladen

VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...
VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...
VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...Call Girls in Nagpur High Profile
 
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...Suhani Kapoor
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Naicy mandal
 
NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...
NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...
NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...Amil baba
 
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...
(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...
(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...motiram463
 
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...Pooja Nehwal
 
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Call Girls in Nagpur High Profile
 
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...Pooja Nehwal
 
Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...
Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...
Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...anilsa9823
 
presentation about microsoft power point
presentation about microsoft power pointpresentation about microsoft power point
presentation about microsoft power pointchhavia330
 
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样qaffana
 
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...ranjana rawat
 
Call Girls in Vashi Escorts Services - 7738631006
Call Girls in Vashi Escorts Services - 7738631006Call Girls in Vashi Escorts Services - 7738631006
Call Girls in Vashi Escorts Services - 7738631006Pooja Nehwal
 
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查
如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查
如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查awo24iot
 

Kürzlich hochgeladen (20)

VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...
VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...
VVIP Pune Call Girls Warje (7001035870) Pune Escorts Nearby with Complete Sat...
 
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Chikhali Call Me 7737669865 Budget Friendly No Advance Booking
 
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
VIP Call Girls Kavuri Hills ( Hyderabad ) Phone 8250192130 | ₹5k To 25k With ...
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
 
NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...
NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...
NO1 Verified Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi A...
 
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Bhavna Call 7001035870 Meet With Nagpur Escorts
 
(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...
(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...
(👉Ridhima)👉VIP Model Call Girls Mulund ( Mumbai) Call ON 9967824496 Starting ...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
 
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...
 
Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...
Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...
Lucknow 💋 Call Girls Adil Nagar | ₹,9500 Pay Cash 8923113531 Free Home Delive...
 
presentation about microsoft power point
presentation about microsoft power pointpresentation about microsoft power point
presentation about microsoft power point
 
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
哪里办理美国宾夕法尼亚州立大学毕业证(本硕)psu成绩单原版一模一样
 
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
(MEGHA) Hinjewadi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune E...
 
Call Girls in Vashi Escorts Services - 7738631006
Call Girls in Vashi Escorts Services - 7738631006Call Girls in Vashi Escorts Services - 7738631006
Call Girls in Vashi Escorts Services - 7738631006
 
Call Girls In Vaishali 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In Vaishali 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In Vaishali 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In Vaishali 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
 
如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查
如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查
如何办理(Adelaide毕业证)阿德莱德大学毕业证成绩单Adelaide学历认证真实可查
 

Understanding Android Benchmarks

  • 1. Understanding Android Benchmarks “freedom” koan-sin tan freedom@computer.org OSDC.tw,Taipei Apr 11th, 2014 1
  • 2. disclaimers • many of the materials used in this slide deck are from the Internet and textbooks, e.g., many of the following materials are from “Computer Architecture: A Quantitative Approach,” 1st ~ 5th ed • opinions expressed here are my personal one, don’t reflect my employer’s view 2
  • 3. who am i • did some networking and security research before • working for a SoC company, recently on • big.LITTLE scheduling and related stuff • parallel construct evaluation • run benchmarking from time to time • for improving performance of our products, and • know what our colleagues' progress 3
  • 4. • Focusing on CPU and memory parts of benchmarks • let’s ignore graphics (2d, 3d), storage I/O, etc. 4
  • 5. Blackbox ! • google image search “benchmark”, you can find many of them are Android-related benchmarks • Similar to recently Cross-Strait Trade in Services Agreement (TiSA), most benchmarks on Android platform are kinda blackbox 5
  • 6. Is Apple A7 good? • When Apple released the new iPhone 5s, you saw many technical blog showed some benchmarks for reviews they came up • commonly used ones: • GeekBench • JavaScript benchmarks • Some graphics benchmarks • Why? Are they right ones? etc. e.g., http://www.anandtech.com/show/7335/the-iphone-5s-review 6
  • 9. No, not improvement in this way 9 http:// www.anandtech.com /show/7384/state-of- cheating-in-android- benchmarks
  • 10. Assuming there is not cheating, what we we can do?
  • 11. Outline • Performance benchmark review • Some Android benchmarks • What we did and what still can be done • Future 11
  • 12. To quote what Prof. Raj Jain quoted • Benchmark v. trans. To subject (a system) to a series of tests in order to obtain prearranged results not available on competitive systems From:“The Devil’s DP Dictionary” S. Kelly-Bootle 12
  • 13. Why benchmarking • We did something good, let check if we did it right • comparing with own previous results to see if we break anything • We want to know how good our colleagues in other places are 13
  • 14. What to report? • Usually, what we mean by “benchmarking” is to measure performance • What to report? • intuitive answer: how many things we do in certain period of time • yes, time. E.g., MIPS, MFLOPS, MiB/s, bps 14
  • 15. MIPS and MFLOPS • MIPS  (Million  Instruc0ons  per  Second),  MFLOPS  (Million   Floa0ng-­‐Point  Opera0ons  per  Second)   • All  instruc0ons  are  not  created  equal   – CISC  machine  instruc0ons  usually  accomplish  a  lot  more  than   those  of  RISC  machines,  comparing  the  instruc0ons  of  a  CISC   machine  and  a  RISC  machine  is  similar  to  comparing  La0n  and   Greek 15
  • 16. MIPS  and  what’s  wrong  with  them • MIPS  is  instruc0on  set  dependent,  making  it   difficult  to  compare  MIPS  of  one  computers  with   different  ISA   • MIPS  varies  between  programs  on  the  same   computers;    and  most  importantly,   • MIPS  can  vary  inversely  to  performance   –w/  hardware  FP,  generally,  MIPS  is  smaller 16
  • 17. MFLOPS  and  what’s  wrong  with  them • Applied  only  to  programs  with  floa0ng-­‐point   opera0ons   • Opera0ons  instead  of  instruc0ons,  but  s0ll     –floa0ng-­‐point  instruc0ons  are  different  on  machines   different  ISAs   –Fast  and  slow  floa0ng-­‐point  opera0ons   • Possible  solu0on:  weight  and  source  code  level   count   –ADD,  SUB,  COMPARE  :  1   –DIVIDE,    SQRT:  2   –EXP,  SIN:  4 17
  • 18. • The  best  choice  of  benchmarks  to  measure   performance  is  real  applica0ons 18
  • 19. Problema0c  benchmarks • Kernel:  small,  key  pieces  of  real  applica0ons,  e.g.,   linpack   • Toy  programs:  100-­‐line  programs  from  beginning   programming  assignments,  e.g.,  quicksort   • Synthe0c  benchmarks:  fake  programs  invented  to   try  to  match  the  profile  and  behavior  of  really   applica0ons,  e.g.,  Dhrystone 19
  • 20. Why  they  are  disreputed? • Small,  fit  in  cache   • Obsolete  instruc0on  mix   • Uncontrolled  source  code   • Prone  to  compiler  tricks   • Short  run0mes  on  modern  machines   • Single-­‐number  performance  characteriza0on  with  a   single  benchmark   • Difficult  to  reproduce  results  (short  run0me  and   low-­‐precision  UNIX  0mer) 20
  • 21. Dhrystone • Source   –hhp://homepages.cwi.nl/~steven/dry.c   • <  1000  LoC   –Size  of  CA15  binary  compiled  with  bionic     • Instruc0ons:  ~  14  KiB text data bss dec 13918 467 10266 24660 21
  • 22. Whetstone • Dhrystone  is  a  pun  on   Whetstone   • Source  code:  hhp:// www.netlib.org/ benchmark/whetstone.c Test MFLOPS MOPS ms N1 float 119.78 0.16 N2 float 171.98 0.78 N3 if 154.25 0.67 N4 fixpt 397.48 0.79 N5 cos 19.08 4.36 N6 float 84.22 6.41 N7 equal 86.84 2.13 N8 exp 5.95 6.26 MWIPS 463.97 21.55 22
  • 23. More  on  Synthe0c  benchmarks • The  best  known  examples  of  synthe0c  benchmarks  are  Whetstone  and   Dhrystone     • Problems:   – Compiler  and  hardware  op0miza0ons  can  ar0ficially  inflate  performance  of  these   benchmarks  but  not  of  real  programs   – The  other  side  of  the  coin  is  that  because  these  benchmarks  are  not  natural  programs,   they  don’t  reward  op0miza0ons  of  behaviors  that  occur  in  real  programs   • Examples:   – Op0mizing  compilers  can  discard  25%  of  the  Dhrystone  code;  examples  include  loops   that  are  only  executed  once,  making  the  loop  overhead  instruc0ons  unnecessary   – Most  Whetstone  floa0ng-­‐point  loops  execute  small  numbers  of  0mes  or  include  calls   inside  the  loop.  These  characteris0cs  are  different  from  many  real  programs   – Some  more  discussion  in  1st  edi0on  of  the  textbook 23
  • 24. LINPACK • LINPACK:  a  floa0ng  point  benchmark  from  the   manual  of  LINPACK  library   • Source   –hhp://www.netlib.org/benchmark/linpackc   –hhp://www.netlib.org/benchmark/linpackc.new   • 883  LoC   –Size  of  CA15  binary  compiled  with  bionic   • Instruc0ons:  ~  13  KiB text data bss dec 12670 408 0 13086 24
  • 25. 25
  • 26. CoreMark  (1/2) • CoreMark  is  a  benchmark  that  aims  to  measure  the   performance  of  central  processing  units  (CPU)  used   in  embedded  systems.  It  was  developed  in  2009  by  Shay  Gal-­‐On   at  EEMBC  and  is  intended  to  become  an  industry  standard,   replacing  the  an0quated  Dhrystone  benchmark   • The  code  is  wrihen  in  C  code  and  contains  implementa0ons  of   the  following  algorithms:     – Linked  list  processing.   – Matrix  (mathema0cs)  manipula0on  (common  matrix  opera0ons),   – state  machine  (determine  if  an  input  stream  contains  valid  numbers),   and   – CRC   • from  wikipedia 26
  • 27. CoreMark  (2/2) name LoC core_list_join.c 496 core_matrix.c 308 core_stat.c 277 core_util.c 210 • CoreMark  vs.  Dhrystone   –Repor0ng  rule   –Use  of  library  calls,  e.g.,   malloc()  is  avoided   –CRC  to  make  sure  data  are   corrected   • However,  CoreMark  is  a     kernel  +  synthe0c   benchmark,  s0ll  quite   small  footprint text data bss dec 18632 456 20 19108 27
  • 28. So? • Too  overcome  the  danger  of  placing  eggs  in  one   basket,  collec0ons  of  benchmark  applica0ons,   called  benchmark  suites,  are  popular  measure  of   performance  of  processors  with  variety  of   applica0ons   • Standard  Performance  Evalua0on  Corpora0on   (SPEC) 28
  • 29. 29
  • 30. Why  CPU2000  in  2010s? • Why  ARM  s0cks  with  SPEC  CPU2000  instead  of   CPU2006   –1999  q4  results,  earliest  available  CPU2000  results  (hhp:// www.spec.org/cpu2000/results/res1999q4/)   • CINT2000  base:  133  –  424     • CFP2000  base:  126  –  514   –2005  Opteron  144,  1.8  GHz   • 1,440  (CA15  1.9  GHz  reported  nVidia  is  1,168)   –CPU2006  requires  much  more  DRAM,  1  GiB  DRAM  is  not   enough name CA9 CA7 CA15 Krait SPECint 200 356 320 537 326 SPECfp 2000 298 236 567 350 All normalized to 1.0 GHz 30
  • 31. SPEC  numbers  from  Quan0ta0ve   Approach  5th  Edi0on 31
  • 32. How  long  does  SPEC  CPU2000  take? • About  1  hrs  to  compile   • Run0me:  Sum  of  base   run0me  mul0plied  by  3   – E.g.,  1.7  GHz  CA15,   (2256+3229)  x  3  =    16,455  s  ~=   4.57  hr   – For  1.0  GHz:  4.57  x  1.7  =  7.77   hr   – For  CA7  assuming  twice  slower:   7.77  *  2  =  15.54  hr Benchmark Reference Base Base Time Runtime Ratio 164.gzip 1400 215 652 175.vpr 1400 198 707 176.gcc 1100 94.8 1161 181.mcf 1800 266 677 186.crafty 1000 118 850 197.parser 1800 291 619 252.eon 1300 87.8 1480 253.perlbmk 1800 172 1045 254.gap 1100 107 1026 255.vortex 1900 211 899 256.bzip2 1500 203 740 300.twolf 3000 399 752 SPECint_base2000 2256 854 Benchmark Reference Base Base Time Runtime Ratio 68.wupwise 1600 162 991 171.swim 3100 389 797 172.mgrid 1800 339 532 173.applu 2100 241 870 177.mesa 1400 112 1254 178.galgel 2900 201 1444 179.art 2600 195 1332 183.equake 1300 157 828 187.facerec 1900 183 1036 188.ammp 2200 353 623 189.lucas 2000 134 1491 191.fma3d 2100 212 988 200.sixtrack 1100 241 456 301.apsi 2600 310 839 SPECfp_base2000 435     3229 909.6 32
  • 33. Figure  1.16  SPEC2006  programs  and  the  evolu0on  of  the  SPEC  benchmarks  over  0me,  with  integer  programs  above  the  line  and  floa0ng-­‐point   programs  below  the  line.  Of  the  12  SPEC2006  integer  programs,  9  are  wrihen  in  C,  and  the  rest  in  C++.  For  the  floa0ng-­‐point  programs,  the  split  is  6   in  Fortran,  4  in  C++,  3  in  C,  and  4  in  mixed  C  and  Fortran.  The  figure  shows  all  70  of  the  programs  in  the  1989,  1992,  1995,  2000,  and  2006  releases.   The  benchmark  descrip0ons  on  the  les  are  for  SPEC2006  only  and  do  not  apply  to  earlier  versions.  Programs  in  the  same  row  from  different   genera0ons  of  SPEC  are  generally  not  related;  for  example,  fpppp  is  not  a  CFD  code  like  bwaves.  Gcc  is  the  senior  ci0zen  of  the  group.  Only  3  integer   programs  and  3  floa0ng-­‐point  programs  survived  three  or  more  genera0ons.  Note  that  all  the  floa0ng-­‐point  programs  are  new  for  SPEC2006.   Although  a  few  are  carried  over  from  genera0on  to  genera0on,  the  version  of  the  program  changes  and  either  the  input  or  the  size  of  the  benchmark   is  osen  changed  to  increase  its  running  0me  and  to  avoid  perturba0on  in  measurement  or  domina0on  of  the  execu0on  0me  by  some  factor  other   than  CPU  0me.   33
  • 34. EEMBC • Embedded  Microprocessor  Benchmark  Consor0um    (EEMBC):  41  kernels   used  to  predict  performance  of  different  embedded  applica0ons:   – Automo0ve/industrial   – Consumer   – Networking   – Office  automa0on   – Telecommunica0on   • 3rd  edi0on  showed  some  EEMBC  results,  4th  edi0on  changed  the  mind   • Unmodified  performance  and  “full-­‐fury”  performance   • Kernel,  repor0ng  op0ons   – Not  a  good  predictor  of  rela0ve  performance  of  different  embedded  computers 34
  • 35. Report  benchmark  results • Reproducible   –Machine  configura0on  (Hardware,  sosware  (OS,  compiler  etc.))   • Summarizing  results   –You  should  not  add  different  numbers   • Some  use  weighted  average   –Ra0o,  compare  with  a  reference  machine   • Geometric  ra1o   –The  geometric  mean  of  the  ra0os  is  the  same  as  the  ra0os  of   geometric  means   –The  ra0o  of  the  geometric  means  is  equal  to  the  geometric  mean   of  the  performance  ra0os 35
  • 37. • Fallacy:  Benchmarks  remain  valid  indefinitely   –Ability  to  resist  “benchmark  engineering”  or   “benchmarke0ng”   –gcc  is  the  only  survivor  from  SPEC89   • Almost  70%  of  all  programs  from  SPEC2000  or  earlier  were   dropped  from  the  next  release 37
  • 38. Other  benchmarks • Stream   –To  test  memory  bandwidth   –It  also  tests  floa0ng-­‐point  performance   –Op0ons  of  floa0ng-­‐point  (double,  8  bytes)  array   • copy,  scale,  add,  triad   • lmbench   –Micro  benchmark  to  measure  sosware/hardware   overhead  from  sosware  perspec0ve   –lmbench  paper  (1996),  hhp://www.bitmover.com/ lmbench/lmbench-­‐usenix.pdf name kernel bytes/iter FLOPS/iter COPY a(i) = b(i) 16 0 SCALE a(i) = q*b(i) 16 1 SUM a(i) = b(i) + c(i) 24 1 TRIAD a(i) = b(i) + q*c(i) 24 2 38
  • 39. Stream 5.10 for (k=0; k<NTIMES; k++) { times[0][k] = mysecond(); for (j=0; j<STREAM_ARRAY_SIZE; j++) c[j] = a[j]; times[0][k] = mysecond() - times[0][k]; times[1][k] = mysecond(); for (j=0; j<STREAM_ARRAY_SIZE; j++) b[j] = scalar*c[j]; times[1][k] = mysecond() - times[1][k]; times[2][k] = mysecond(); for (j=0; j<STREAM_ARRAY_SIZE; j++) c[j] = a[j]+b[j]; times[2][k] = mysecond() - times[2][k]; times[3][k] = mysecond(); for (j=0; j<STREAM_ARRAY_SIZE; j++) a[j] = b[j]+scalar*c[j]; times[3][k] = mysecond() - times[3][k]; } 39
  • 40. lmbench • lmbench  is  a  micro-­‐benchmark  suite  designed  to   focus  ahen0on  on  the  basic  building  blocks  of   many  common  system  applica0ons,  such  as   databases,  simula0ons,  sosware  development,   and  networking 40
  • 41. Parallel?  Let’s  look  at  other  SPEC   benchmarks • SPECapc  for  3ds  Max™  2011,  performance  evalua0on  sosware  for  systems  running  Autodesk  3ds  Max  2011.     • SPECapcSM  for  Lightwave  3D  9.6,  performance  evalua0on  sosware  for  systems  running  NewTek  LightWave  3D  v9.6   sosware.     • SPECjbb2005,  evaluates  the  performance  of  server  side  Java  by  emula0ng  a  three-­‐0er  client/server  system  (with   emphasis  on  the  middle  0er).     • SPECjEnterprise2010,  a  mul0-­‐0er  benchmark  for  measuring  the  performance  of  Java  2  Enterprise  Edi0on  (J2EE)   technology-­‐based  applica0on  servers.     • SPECjms2007,  Java  Message  Service  performance     • SPECjvm2008,  measuring  basic  Java  performance  of  a  Java  Run0me  Environment  on  a  wide  variety  of  both  client  and   server  systems.     • SPECapc,  performance  of  several  3D-­‐intensive  popular  applica0ons  on  a  given  system     • SPEC  MPI2007,  for  evalua0ng  performance  of  parallel  systems  using  MPI  (Message  Passing  Interface)  applica0ons.     • SPEC  OMP2001  V3.2,  for  evalua0ng  performance  of  parallel  systems  using  OpenMP  (hhp://www.openmp.org)   applica0ons.     • SPECpower_ssj2008,  evaluates  the  energy  efficiency  of  server  systems.     • SPECsfs2008,  File  server  throughput  and  response  0me  suppor0ng  both  NFS  and  CIFS  protocol  access     • SPECsip_Infrastructure2011,  SIP  server  performance     • SPECviewperf  11,  performance  of  an  OpenGL  3D  graphics  system,  tested  with  various  rendering  tasks  from  real   applica0ons     • SPECvirt_sc2010  ("SPECvirt"),  evaluates  the  performance  of  datacenter  servers  used  in  virtualized  server  consolida0on   41
  • 42. PARSEC • The  Princeton  Applica0on   Repository  for  Shared-­‐Memory   Computers  (PARSEC)  is  a   benchmark  suite  composed  of   mul0threaded  programs.  The   suite  focuses  on  emerging   workloads  and  was  designed  to  be   representa0ve  of  next-­‐genera0on   shared-­‐memory  programs  for   chip-­‐mul0processors   • Didn’t  really  use  it  yet   • hhp://parsec.cs.princeton.edu/ Workload Parallelization Model Pthreads OpenMP Intel TBB blackscholes Yes Yes Yes bodytrack Yes Yes Yes canneal Yes No No dedup Yes No No facesim Yes No No ferret Yes No No fluidanimate Yes No Yes freqmine No Yes No raytrace Yes No No streamcluster Yes No Yes swaptions Yes No Yes vips Yes No No x264 Yes No No 42
  • 43. Are Dhrystone usefully? • Yes, if you know the limitation of them • Don't do marketing as those benchmarks mean real user perceived performance 43
  • 44. iPhone'5s' iPhone'5s'32,bit' CA15' CA7' Krait'400' DMIPS/MHz' 7.47'' 5.70'' 2.71'' 1.67'' 2.46'' 0.00'' 1.00'' 2.00'' 3.00'' 4.00'' 5.00'' 6.00'' 7.00'' 8.00'' DMIPS/MHz) A7 Dhrystone 44
  • 45. iPhone'5s' iPhone'5s'32, bit' 'CA15' CA7' Krait'400' MFLOPS/GHz' 722' 723' 449' 119' 299' 0' 100' 200' 300' 400' 500' 600' 700' 800' MFLOPS/GHz+ A7 linpack MFLOPS 45
  • 46. iPhone'5s' iPhone'5s'32,bit' CA15' CA7' Krait'400' CoreMark/MHz' 5.72'' 4.45'' 3.67'' 2.46'' 3.30'' 0.00'' 1.00'' 2.00'' 3.00'' 4.00'' 5.00'' 6.00'' 7.00'' CoreMark/MHz+ A7 CoreMark 46
  • 47. Different items • Example, GeekBench 3 • Arithmetic mean with different weight? How? • Good properties of geometric mean 47
  • 48. Source code • So far what we talked about are all software with source code available, either publicly/freely, e.g., Dhrystone or little amount of $, e.g., SPEC CPU 48
  • 49. • Benchmark scores/results usually depend on compiler, complier flags, processors, and systems 49
  • 50. Outline • Performance benchmark review • Some Android benchmarks • What we did and what still can be done • Future 50
  • 51. Back to Android • What kinds of Benchmarks are available, or used to compare performance • Apps with native benchmarks:Antutu, GeekBench • Java apps, e.g., Quadrant • Hybrid: with both native and Java, e.g.,AndEBench and CF-Bench • We also use SPEC CPU2000 and other stuff internally 51
  • 52. Ars Technica List arrayOfPackageInfo[0]  =  new  PackageInfo("com.aurorasoftworks.quadrant.ui.standard",  false);   arrayOfPackageInfo[1]  =  new  PackageInfo("com.aurorasoftworks.quadrant.ui.advanced",  false);   arrayOfPackageInfo[2]  =  new  PackageInfo("com.aurorasoftworks.quadrant.ui.professional",  false);   arrayOfPackageInfo[3]  =  new  PackageInfo("com.redlicense.benchmark.sqlite",  false);   arrayOfPackageInfo[4]  =  new  PackageInfo("com.antutu.ABenchMark",  false);   arrayOfPackageInfo[5]  =  new  PackageInfo("com.greenecomputing.linpack",  false);   arrayOfPackageInfo[6]  =  new  PackageInfo("com.greenecomputing.linpackpro",  false);   arrayOfPackageInfo[7]  =  new  PackageInfo("com.glbenchmark.glbenchmark27",  false);   arrayOfPackageInfo[8]  =  new  PackageInfo("com.glbenchmark.glbenchmark25",  false);   arrayOfPackageInfo[9]  =  new  PackageInfo("com.glbenchmark.glbenchmark21",  false);   arrayOfPackageInfo[10]  =  new  PackageInfo("ca.primatelabs.geekbench2",  false);   arrayOfPackageInfo[11]  =  new  PackageInfo("com.eembc.coremark",  false);   arrayOfPackageInfo[12]  =  new  PackageInfo("com.flexycore.caffeinemark",  false);   arrayOfPackageInfo[13]  =  new  PackageInfo("eu.chainfire.cfbench",  false);   arrayOfPackageInfo[14]  =  new  PackageInfo("gr.androiddev.BenchmarkPi",  false);   arrayOfPackageInfo[15]  =  new  PackageInfo("com.smartbench.twelve",  false);   arrayOfPackageInfo[16]  =  new  PackageInfo("com.passmark.pt_mobile",  false);   arrayOfPackageInfo[17]  =  new  PackageInfo("se.nena.nenamark2",  false);   arrayOfPackageInfo[18]  =  new  PackageInfo("com.samsung.benchmarks",  false);   arrayOfPackageInfo[19]  =  new  PackageInfo("com.samsung.benchmarks:db",  false);   arrayOfPackageInfo[20]  =  new  PackageInfo("com.samsung.benchmarks:es1",  false);   arrayOfPackageInfo[21]  =  new  PackageInfo("com.samsung.benchmarks:es2",  false);   arrayOfPackageInfo[22]  =  new  PackageInfo("com.samsung.benchmarks:g2d",  false);   arrayOfPackageInfo[23]  =  new  PackageInfo("com.samsung.benchmarks:fs",  false);   arrayOfPackageInfo[24]  =  new  PackageInfo("com.samsung.benchmarks:ks",  false);   arrayOfPackageInfo[25]  =  new  PackageInfo("com.samsung.benchmarks:cpu   ! ! CPU and Memory related: Quadrant, Antutu, linpack, GeekBench, AndEBench (coremark), CaffeineMark, Pi, PassMark, Samsung’s benchmark 52
  • 53. Antutu 3.x • CPU: integer, floating point • memory: RAM • Graphics: 2D, 3D • I/O: Database, SD read, SD write ! ! • What are you benchmarking • What's you workload • How to calculate scores 53
  • 54. What on earth are they doing? • Actually no public available information • But, with good enough background knowledge and proper tools (we’ll talk about these later), we can figure it out • It turns out most of them are from the BYTE nbench (http://en.wikipedia.org/wiki/ NBench) 54
  • 55. AnTuTu  3.x  CPU  and  Memory  Tests nbench item Used by Antutu Antutu part Antutu percentage on progress bar Order nbench category NUMERIC SORT yes Integer 27% 4 integer STRING SORT yes RAM 1% 1 memory BITFIELD yes RAM 1% 2 memory FP EMULATION no FOURIER yes floating 47% 7 floating point ASSIGNMENT yes RAM 8% 3 memory IDEA yes Integer 27% 5 integer HUFFMAN yes Integer 34% 6 integer NEURAL NET no LU DECOMPOSITION no 55
  • 56. More  close  look ▪ RAM – String sort: • string Heap sort: StrHeapSort() • MoveMemory() à memmove() – Bit Field: • Bit field test: DoBitops() – Assignment: • Task Assignment test: DoAssignment() ▪ Integer – Numeric sort: • Numeric heap sort: NumHeapSort() – IDEA: • IDEA encryption and decryption: cipher_idea() – Huffman: • Huffman encoding ▪ Floating point: – Fourier: • Fourier transform: pow(), sin(), cos() 56
  • 57. for(i=top; i>0; --i)! {! "strsift(optrarray,strarray,numstrings,0,i);! ! "/* temp = string[0] */! "tlen=*strarray;! "MoveMemory((farvoid *)&temp[0], /* Perform exchange */! " "(farvoid *)strarray,! " "(unsigned long)(tlen+1));! ! ! "/* string[0]=string[i] */! "tlen=*(strarray+*(optrarray+i));! "stradjust(optrarray,strarray,numstrings,0,tlen);! "MoveMemory((farvoid *)strarray,! " "(farvoid *)(strarray+*(optrarray+i)),! " "(unsigned long)(tlen+1));! ! "/* string[i]=temp */! "tlen=temp[0];! "stradjust(optrarray,strarray,numstrings,i,tlen);! "MoveMemory((farvoid *)(strarray+*(optrarray+i)),! " "(farvoid *)&temp[0],! " "(unsigned long)(tlen+1));! ! } String Sort in NBench • Sorts an array of strings of arbitrary length • Test memory movement performance • Non-sequential performance of cache, with added burden that moves are byte-wide and can occur on odd address boundaries 57
  • 58. Bit field in NBench • Executes 3 bit manipulation functions • Exercises "bit twiddling“ performance. Travels through memory bit-by-bit in a sequential fashion; different from sorts in that data is merely altered in place • Operations: • Set: OR 1 • Clear: AND 0 • Toggle: XOR • Set, clear: ToggleBitRun() • Toggle: FlipBitRun() static void ToggleBitRun(farulong *bitmap, /* Bitmap */ ulong bit_addr, /* Address of bits to set */ ulong nbits, /* # of bits to set/clr */ uint val) /* 1 or 0 */ { unsigned long bindex; /* Index into array */ unsigned long bitnumb; /* Bit number */ ! while(nbits--) { #ifdef LONG64 bindex=bit_addr>>6; /* Index is number /64 */ bitnumb=bit_addr % 64; /* Bit number in word */ #else bindex=bit_addr>>5; /* Index is number /32 */ bitnumb=bit_addr % 32; /* bit number in word */ #endif if(val) bitmap[bindex]|=(1L<<bitnumb); else bitmap[bindex]&=~(1L<<bitnumb); bit_addr++; } return; } 58
  • 59. Assignment in NBench • The test moves through large integer arrays in both row-wise and column-wise fashion. Cache/memory with good sequential performance should see a boost (memory is altered in place -- no moving as in a sort operation) • Yes, basically, sequential array assignment with some kind of table look-ups /* ** Step through rows. For each one that is not currently ** assigned, see if the row has only one zero in it. If so, ** mark that as an assigned row/col. Eliminate other zeros ** in the same column. */ for(i=0;i<ASSIGNROWS;i++) { numzeros=0; for(j=0;j<ASSIGNCOLS;j++) if(tableau[i][j]==0L) if(assignedtableau[i][j]==0) { numzeros++; selected=j; } if(numzeros==1) { numassigns++; totnumassigns++; assignedtableau[i][selected]=1; for(k=0;k<ASSIGNROWS;k++) if((k!=i) && (tableau[k][selected]==0)) assignedtableau[k][selected]=2; } } 59
  • 60. Numeric Sort in NBench • Sorts an array of long integers with heap sort • Generic integer performance. Should exercise non-sequential performance of cache (or memory if cache is less than 8K). Moves 32- bit longs at a time, so 16-bit processors will be at a disadvantage static void NumHeapSort(farlong *array, ulong bottom, /* Lower bound */ ulong top) /* Upper bound */ { ulong temp; /* Used to exchange elements */ ulong i; /* Loop index */ ! /* ** First, build a heap in the array */ for(i=(top/2L); i>0; --i) NumSift(array,i,top); ! /* ** Repeatedly extract maximum from heap and place it at the ** end of the array. When we get done, we'll have a sorted ** array. */ for(i=top; i>0; --i) { NumSift(array,bottom,i); temp=*array; /* Perform exchange */ *array=*(array+i); *(array+i)=temp; } return; 60
  • 61. static void cipher_idea(u16 in[4],! " "u16 out[4],! " "register IDEAkey Z)! {! register u16 x1, x2, x3, x4, t1, t2;! /* register u16 t16;! register u16 t32; */! int r=ROUNDS;! ! x1=*in++;! x2=*in++;! x3=*in++;! x4=*in;! ! do {! "MUL(x1,*Z++);! "x2+=*Z++;! "x3+=*Z++;! "MUL(x4,*Z++);! ! "t2=x1^x3;! "MUL(t2,*Z++);! "t1=t2+(x2^x4);! "MUL(t1,*Z++);! "t2=t1+t2;! ! "x1^=t1;! "x4^=t2;! ! "t2^=x2;! "x2=x3^t1;! "x3=t2;! } while(--r);! MUL(x1,*Z++);! *out++=x1;! *out++=x3+*Z++;! *out++=x2+*Z++;! MUL(x4,*Z);! *out=x4;! return;! } IDEA Encryption in NBench • IDEA: a new block cipher when nbench was in development • Moves through data sequentially in 16-bit chunks 61
  • 62. Huffman in NBench • Everybody knows Huffman code, right? • A combination of byte operations, bit twiddling, and overall integer manipulation ..... /* ** Huffman tree built...compress the plaintext */ bitoffset=0L; /* Initialize bit offset */ for(i=0;i<arraysize;i++) { c=(int)plaintext[i]; /* Fetch character */ /* ** Build a bit string for byte c */ bitstringlen=0; while(hufftree[c].parent!=-2) { if(hufftree[hufftree[c].parent].left==c) bitstring[bitstringlen]='0'; else bitstring[bitstringlen]='1'; c=hufftree[c].parent; bitstringlen++; } ..... 62
  • 63. Fourier in NBench • No, not FFT, • Good measure of transcendental and trigonometric performance of FPU. Little array activity, so this test should not be dependent of cache or memory architecture static double thefunction(double x, /* Independent variable */! " "double omegan, /* Omega * term */! " "int select) /* Choose term */! {! /*! ** Use select to pick which function we call.! */! switch(select)! {! "case 0: return(pow(x+(double)1.0,x));! "case 1: return(pow(x+(double)1.0,x) * cos(omegan * x));! "case 2: return(pow(x+(double)1.0,x) * sin(omegan * x));! } 63
  • 64. Neural Net in NBench • A robust algorithm for solving linear equations • Small-array floating-point test heavily dependent on the exponential function; less dependent on overall FPU performance 64
  • 65. LU Decomposition in NBench • LU Decomposition • Yes, the LU decomposition you learned in linear algebra • A floating-point test that moves through arrays in both row-wise and column-wise fashion. Exercises only fundamental math operations (+, -, *, /) 65
  • 66. GeekBench • A cross-platform one • The only publicly available one we could use to compare Android, iOS, and other platforms • Quite clearly described test items • http://support.primatelabs.com/kb/geekbench/geekbench-3- benchmarks • Explaining how to interpret results • http://support.primatelabs.com/kb/geekbench/interpreting- geekbench-3-scores • Source code available if you pay 66
  • 67. Vellamo • HTML5 • Metal: Dhrystone, Linpack, Branch-K, Stream 5.9, RamJam, Storage • some are well-known; some are written by Quic? • Anyway, all of them are described at http:// www.quicinc.com/vellamo/test-descriptions/ 67
  • 68. CFBench • Used by some people,‘cause • Test both Java and native version • its author is quite active in xda developer forum • Some problems • no good description of tests • some code is wrong, e.g., • its Native Memory Read test is not testing memory read,‘cause malloc()ed array is not initialized 68
  • 69. Outline • Performance benchmark review • Some Android benchmarks • What we did and what still can be done • Future 69
  • 70. How do we improve benchmark performance 70
  • 71. • In the good old days, we have source code, we compile and run benchmark programs • In current Android ecosystem • Usually we don’t have source • Profiling: oprofile, perf, DS-5 • profiling sometimes doesn’t report real bottleneck function, e.g., static functions usually are inlined and don’t have symbol in shipped binaries • binutils: nm, readelf, objdump, gdb • Improving libraries, e.g., libc and libm, and runtime system, e.g., JIT of Dalvik, used by those benchmarks 71
  • 72. Antutu 3.x • memmove() in bionic --> bcopy() in C • rewrite with NEON assembly code • pow(), sin(), cos() in C • rewrite them with assembly 72
  • 73. bcopy() in bionic • MoveMemory() in nbench -> memmove() in bionic - > bcopy() in bionic • memcpy() assembly in bionic and there are processor specific ones (CA9, CA15, Krait). NEON (vector load/ store) helps • not for bcopy() in bionic/libc/bionic/memmove.c ! void *memmove(void *dst, const void *src, size_t n) { const char *p = src; char *q = dst; /* We can use the optimized memcpy if the source and destination * don't overlap. */ if (__builtin_expect(((q < p) && ((size_t)(p - q) >= n)) || ((p < q) && ((size_t)(q - p) >= n)), 1)) { return memcpy(dst, src, n); } else { bcopy(src, dst, n); return dst; } } in bionic/libc/string/bcopy.c /* * Copy a block of memory, handling overlap. * This is the routine that actually implements * (the portable versions of) bcopy, memcpy, and memmove. */ #ifdef MEMCOPY void * memcpy(void *dst0, const void *src0, size_t length) #else #ifdef MEMMOVE void * memmove(void *dst0, const void *src0, size_t length) #else void bcopy(const void *src0, void *dst0, size_t length) #endif #endif { ..... 73
  • 74. Antutu 3.x • For people with source code • Selection of toolchain and compiler options may cause huge difference, e.g., bit field • Some version of x86 binary for Antutu 3.x was compiled with Intel, bit-by-bit operations turned in word-wide (32-bit) operations, and the speed up is about 70x faster 74
  • 75. Stream copy usually turned into memcpy() 75
  • 76. remote gdb 1. get /system/bin/app_process and /system/bin/linker of the target system and necessary shared libraries, e.g., /data/data/eu.chainfire.cfbench/lib/libCFBench.so • adb pull /system/bin/app_process! • adb pull /system/bin/linker lib/armeabi-v7a/! • adb pull /data/data/eu.chainfire.cfbench/lib/libCFBench.so lib/ armeabi-v7a/! 2. arm-linux-gnueabi-gdb ./app_process 3. on the target device, attach gdbserver to the running process you wanna debug • ./gdbserver --attach :5039 3484 4. set shared library search path • (gdb) set solib-search-path /Users/freedom/tmp/cfbench/lib/armeabi-v7a 5. ‘adb forward tcp:5039 tcp:5039’ and set remote target • (gdb) target remote :5039 6. you can set breakpoints, print backtrace, disassemble, etc. 76
  • 77. • (gdb) b Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned • (gdb) disassemble Dump of assembler code for function Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned: 0x74b65848 <+0>: stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, lr} => 0x74b6584c <+4>: bl 0x74b654ac <loadLib> 0x74b65850 <+8>: mov.w r0, #1048576 ; 0x100000 0x74b65854 <+12>: blx 0x74b65358 0x74b65858 <+16>: movs r6, #0 0x74b6585a <+18>: movw r9, #9999 ; 0x270f 0x74b6585e <+22>: mov r8, r0 0x74b65860 <+24>: bl 0x74b6547c <getTickCount> 0x74b65864 <+28>: add.w r5, r8, #1048576 ; 0x100000 0x74b65868 <+32>: mov r10, r0 0x74b6586a <+34>: mov r3, r8 0x74b6586c <+36>: ldr.w r2, [r3], #4 0x74b65870 <+40>: cmp r3, r5 0x74b65872 <+42>: add r4, r2 0x74b65874 <+44>: bne.n 0x74b6586c <Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned+36> 0x74b65876 <+46>: bl 0x74b6547c <getTickCount> 0x74b6587a <+50>: adds r6, #1 0x74b6587c <+52>: rsb r7, r10, r0 0x74b65880 <+56>: cmp r7, r9 77
  • 78. Quadrant • Written in Java • CPU: Not really testing CPU • Memory: profiling shows that memcpy() is heavily in used • What can we do • optimized JIT part of DVM 78
  • 79. What other possible ways? • binary translation during • installation time • run time 79
  • 80. Wrap-up • Popular CPU and Memory benchmarks on Android mostly don’t reflect real CPU performance • We know CPU performance != System performance != user-perceived performance • There is always room for improvement 80
  • 82. Recent progress • EEMBC’s AndEBench 2.0 is under development (http:// www.eembc.org/press/pressrelease/130128.html) • Qualcomm asked BDTi to develop new benchmark (http://www.qualcomm.com/media/blog/2013/08/16/ mobile-benchmarking-turning-corner-user- experience). • Samsung with other vendors launched MobileBench Consortium last year • Antutu is still growing 82
  • 84. 廣告 • MediaTek joined linaro.org last month • linaro.org is a NPO working on open source Linux/Android related stuff for ARM-based SoCs • So MTK is getting more open recently • And, it’s looking for open source engineers • Talk to guys at MTK booth or me • There are more non- open source jobs 84
  • 86. Some References to Understand Performance Benchmark • Raj Jain,“The Art of Computer Systems Performance Analysis:Techniques for Experimental Design, Measurement, Simulation, and Modeling”,Wiley, 1991 • Quantitative Approach • A good SPEC introduction article, http://mrob.com/ pub/comp/benchmarks/spec.html • Kaivalya M. Dixit,“Overview of the SPEC Benchmarks,” http://people.cs.uchicago.edu/~chliu/ doc/benchmark/chapter9.pdf 86
  • 87. Basic system parameters   ------------------------------------------------------------------------------   Host OS Description Mhz tlb cache mem scal   pages line par load   bytes   --------- ------------- ----------------------- ---- ----- ----- ------ ----   localhost Linux 3.4.5-g armv7l-linux-gnu 1696 7 64 4.4700 1   ! Processor, Processes - times in microseconds - smaller is better   ------------------------------------------------------------------------------   Host OS Mhz null null open slct sig sig fork exec sh   call I/O stat clos TCP inst hndl proc proc proc   --------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----   localhost Linux 3.4.5-g 1696 0.49 0.67 2.54 5.95 8.52 0.67 5.05 876. 1668 4654   ! Basic integer operations - times in nanoseconds - smaller is better   -------------------------------------------------------------------   Host OS intgr intgr intgr intgr intgr   bit add mul div mod   --------- ------------- ------ ------ ------ ------ ------   localhost Linux 3.4.5-g 1.0700 0.1100 3.4000 90.5 14.8   ! Basic float operations - times in nanoseconds - smaller is better   -----------------------------------------------------------------   87
  • 88. Context switching - times in microseconds - smaller is better   -------------------------------------------------------------------------   Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K   ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw   --------- ------------- ------ ------ ------ ------ ------ ------- -------   localhost Linux 3.4.5-g 8.9700 4.9000 6.1400 12.3 7.68000 57.6   ! *Local* Communication latencies in microseconds - smaller is better   ---------------------------------------------------------------------   Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP   ctxsw UNIX UDP TCP conn   --------- ------------- ----- ----- ---- ----- ----- ----- ----- ----   localhost Linux 3.4.5-g 8.970 17.6 23.9 47.5 71.3 357.   ! File & VM system latencies in microseconds - smaller is better   -------------------------------------------------------------------------------   Host OS 0K File 10K File Mmap Prot Page 100fd   Create Delete Create Delete Latency Fault Fault selct   --------- ------------- ------ ------ ------ ------ ------- ----- ------- -----   localhost Linux 3.4.5-g 700.0 1.259 2.55270 3.048   ! *Local* Communication bandwidths in MB/s - bigger is better   -----------------------------------------------------------------------------   Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem   88
  • 89. PARSEC  content • Blackscholes    This  applica0on  is  an  Intel  RMS  benchmark.  It  calculates  the  prices  for  a  por|olio  of  European  op0ons   analy0cally  with  the  Black-­‐Scholes  par1al  differen1al  equa1on  (PDE).  There  is  no  closed-­‐form  expression  for  the  Black-­‐ Scholes  equa0on  and  as  such  it  must  be  computed  numerically.     • Bodytrack  This  computer  vision  applica0on  is  an  Intel  RMS  workload  which  tracks  a  human  body  with  mul1ple  cameras   through  an  image  sequence.  This  benchmark  was  included  due  to  the  increasing  significance  of  computer  vision   algorithms  in  areas  such  as  video  surveillance,  character  anima0on  and  computer  interfaces.     • Canneal    This  kernel  was  developed  by  Princeton  University.  It  uses  cache-­‐aware  simulated  annealing  (SA)  to  minimize   the  rou1ng  cost  of  a  chip  design.  Canneal  uses  fine-­‐grained  parallelism  with  a  lock-­‐free  algorithm  and  a  very  aggressive   synchroniza0on  strategy  that  is  based  on  data  race  recovery  instead  of  avoidance.     • Dedup  This  kernel  was  developed  by  Princeton  University.  It  compresses  a  data  stream  with  a  combina1on  of  global  and   local  compression  that  is  called  'deduplica1on'.  The  kernel  uses  a  pipelined  programming  model  to  mimic  real-­‐world   implementa0ons.  The  reason  for  the  inclusion  of  this  kernel  is  that  deduplica0on  has  become  a  mainstream  method  for   new-­‐genera0on  backup  storage  systems.     • Facesim  This  Intel  RMS  applica0on  was  originally  developed  by  Stanford  University.  It  computes  a  visually  realis1c   anima1on  of  the  modeled  face  by  simula1ng  the  underlying  physics.  The  workload  was  included  in  the  benchmark  suite   because  an  increasing  number  of  anima0ons  employ  physical  simula0on  to  create  more  realis0c  effects.     • Ferret    This  applica0on  is  based  on  the  Ferret  toolkit  which  is  used  for  content-­‐based  similarity  search.  It  was  developed   by  Princeton  University.  The  reason  for  the  inclusion  in  the  benchmark  suite  is  that  it  represents  emerging  next-­‐ genera0on  search  engines  for  non-­‐text  document  data  types.  In  the  benchmark,  we  have  configured  the  Ferret  toolkit  for   image  similarity  search.  Ferret  is  parallelized  using  the  pipeline  model. 89
  • 90. PARSEC  content • Fluidanimate    This  Intel  RMS  applica0on  uses  an  extension  of  the  Smoothed  Par0cle  Hydrodynamics  (SPH)  method  to   simulate  an  incompressible  fluid  for  interac1ve  anima1on  purposes.  It  was  included  in  the  PARSEC  benchmark  suite   because  of  the  increasing  significance  of  physics  simula0ons  for  anima0ons.     • Freqmine    This  applica0on  employs  an  array-­‐based  version  of  the  FP-­‐growth  (Frequent  PaMern-­‐growth)  method  for   Frequent  Itemset  Mining  (FIMI).  It  is  an  Intel  RMS  benchmark  which  was  originally  developed  by  Concordia  University.   Freqmine  was  included  in  the  PARSEC  benchmark  suite  because  of  the  increasing  use  of  data  mining  techniques.     • Raytrace    The  Intel  RMS  applica0on  uses  a  version  of  the  raytracing  method  that  would  typically  be  employed  for  real-­‐ 0me  anima0ons  such  as  computer  games.  It  is  op0mized  for  speed  rather  than  realism.  The  computa0onal  complexity  of   the  algorithm  depends  on  the  resolu0on  of  the  output  image  and  the  scene.     • Streamcluster    This  RMS  kernel  was  developed  by  Princeton  University  and  solves  the  online  clustering  problem.   Streamcluster  was  included  in  the  PARSEC  benchmark  suite  because  of  the  importance  of  data  mining  algorithms  and  the   prevalence  of  problems  with  streaming  characteris0cs.     • Swap1ons    The  applica0on  is  an  Intel  RMS  workload  which  uses  the  Heath-­‐Jarrow-­‐Morton  (HJM)  framework  to  price  a   porRolio  of  swap1ons.  Swap0ons  employs  Monte  Carlo  (MC)  simula0on  to  compute  the  prices.     • Vips    This  applica0on  is  based  on  the  VASARI  Image  Processing  System  (VIPS)  which  was  originally  developed  through   several  projects  funded  by  European  Union  (EU)  grants.  The  benchmark  version  is  derived  from  a  print  on  demand  service   that  is  offered  at  the  Na0onal  Gallery  of  London,  which  is  also  the  current  maintainer  of  the  system.  The  benchmark   includes  fundamental  image  opera0ons  such  as  an  affine  transforma0on  and  a  convolu0on.     • X264 90