2. disclaimers
• many of the materials used in this slide
deck are from the Internet and textbooks,
e.g., many of the following materials are
from “Computer Architecture: A
Quantitative Approach,” 1st ~ 5th ed
• opinions expressed here are my personal
one, don’t reflect my employer’s view
2
3. who am i
• did some networking and security research before
• working for a SoC company, recently on
• big.LITTLE scheduling and related stuff
• parallel construct evaluation
• run benchmarking from time to time
• for improving performance of our products, and
• know what our colleagues' progress
3
4. • Focusing on CPU and memory parts of
benchmarks
• let’s ignore graphics (2d, 3d), storage I/O,
etc.
4
5. Blackbox
!
• google image search “benchmark”, you can
find many of them are Android-related
benchmarks
• Similar to recently Cross-Strait Trade in
Services Agreement (TiSA), most
benchmarks on Android platform are kinda
blackbox
5
6. Is Apple A7 good?
• When Apple released the new
iPhone 5s, you saw many
technical blog showed some
benchmarks for reviews they
came up
• commonly used ones:
• GeekBench
• JavaScript benchmarks
• Some graphics benchmarks
• Why? Are they right ones? etc.
e.g., http://www.anandtech.com/show/7335/the-iphone-5s-review
6
12. To quote what Prof. Raj Jain quoted
• Benchmark v. trans.
To subject (a system)
to a series of tests in
order to obtain
prearranged results
not available on
competitive systems
From:“The Devil’s DP
Dictionary” S. Kelly-Bootle
12
13. Why benchmarking
• We did something good, let check if we did
it right
• comparing with own previous results to
see if we break anything
• We want to know how good our
colleagues in other places are
13
14. What to report?
• Usually, what we mean by “benchmarking”
is to measure performance
• What to report?
• intuitive answer: how many things we do
in certain period of time
• yes, time. E.g., MIPS, MFLOPS, MiB/s, bps
14
15. MIPS and MFLOPS
• MIPS
(Million
Instruc0ons
per
Second),
MFLOPS
(Million
Floa0ng-‐Point
Opera0ons
per
Second)
• All
instruc0ons
are
not
created
equal
– CISC
machine
instruc0ons
usually
accomplish
a
lot
more
than
those
of
RISC
machines,
comparing
the
instruc0ons
of
a
CISC
machine
and
a
RISC
machine
is
similar
to
comparing
La0n
and
Greek
15
16. MIPS
and
what’s
wrong
with
them
• MIPS
is
instruc0on
set
dependent,
making
it
difficult
to
compare
MIPS
of
one
computers
with
different
ISA
• MIPS
varies
between
programs
on
the
same
computers;
and
most
importantly,
• MIPS
can
vary
inversely
to
performance
–w/
hardware
FP,
generally,
MIPS
is
smaller
16
17. MFLOPS
and
what’s
wrong
with
them
• Applied
only
to
programs
with
floa0ng-‐point
opera0ons
• Opera0ons
instead
of
instruc0ons,
but
s0ll
–floa0ng-‐point
instruc0ons
are
different
on
machines
different
ISAs
–Fast
and
slow
floa0ng-‐point
opera0ons
• Possible
solu0on:
weight
and
source
code
level
count
–ADD,
SUB,
COMPARE
:
1
–DIVIDE,
SQRT:
2
–EXP,
SIN:
4
17
18. • The
best
choice
of
benchmarks
to
measure
performance
is
real
applica0ons
18
19. Problema0c
benchmarks
• Kernel:
small,
key
pieces
of
real
applica0ons,
e.g.,
linpack
• Toy
programs:
100-‐line
programs
from
beginning
programming
assignments,
e.g.,
quicksort
• Synthe0c
benchmarks:
fake
programs
invented
to
try
to
match
the
profile
and
behavior
of
really
applica0ons,
e.g.,
Dhrystone
19
20. Why
they
are
disreputed?
• Small,
fit
in
cache
• Obsolete
instruc0on
mix
• Uncontrolled
source
code
• Prone
to
compiler
tricks
• Short
run0mes
on
modern
machines
• Single-‐number
performance
characteriza0on
with
a
single
benchmark
• Difficult
to
reproduce
results
(short
run0me
and
low-‐precision
UNIX
0mer)
20
22. Whetstone
• Dhrystone
is
a
pun
on
Whetstone
• Source
code:
hhp://
www.netlib.org/
benchmark/whetstone.c
Test MFLOPS MOPS ms
N1 float 119.78 0.16
N2 float 171.98 0.78
N3 if 154.25 0.67
N4 fixpt 397.48 0.79
N5 cos 19.08 4.36
N6 float 84.22 6.41
N7 equal 86.84 2.13
N8 exp 5.95 6.26
MWIPS 463.97 21.55
22
23. More
on
Synthe0c
benchmarks
• The
best
known
examples
of
synthe0c
benchmarks
are
Whetstone
and
Dhrystone
• Problems:
– Compiler
and
hardware
op0miza0ons
can
ar0ficially
inflate
performance
of
these
benchmarks
but
not
of
real
programs
– The
other
side
of
the
coin
is
that
because
these
benchmarks
are
not
natural
programs,
they
don’t
reward
op0miza0ons
of
behaviors
that
occur
in
real
programs
• Examples:
– Op0mizing
compilers
can
discard
25%
of
the
Dhrystone
code;
examples
include
loops
that
are
only
executed
once,
making
the
loop
overhead
instruc0ons
unnecessary
– Most
Whetstone
floa0ng-‐point
loops
execute
small
numbers
of
0mes
or
include
calls
inside
the
loop.
These
characteris0cs
are
different
from
many
real
programs
– Some
more
discussion
in
1st
edi0on
of
the
textbook
23
24. LINPACK
• LINPACK:
a
floa0ng
point
benchmark
from
the
manual
of
LINPACK
library
• Source
–hhp://www.netlib.org/benchmark/linpackc
–hhp://www.netlib.org/benchmark/linpackc.new
• 883
LoC
–Size
of
CA15
binary
compiled
with
bionic
• Instruc0ons:
~
13
KiB
text data bss dec
12670 408 0 13086
24
26. CoreMark
(1/2)
• CoreMark
is
a
benchmark
that
aims
to
measure
the
performance
of
central
processing
units
(CPU)
used
in
embedded
systems.
It
was
developed
in
2009
by
Shay
Gal-‐On
at
EEMBC
and
is
intended
to
become
an
industry
standard,
replacing
the
an0quated
Dhrystone
benchmark
• The
code
is
wrihen
in
C
code
and
contains
implementa0ons
of
the
following
algorithms:
– Linked
list
processing.
– Matrix
(mathema0cs)
manipula0on
(common
matrix
opera0ons),
– state
machine
(determine
if
an
input
stream
contains
valid
numbers),
and
– CRC
• from
wikipedia
26
27. CoreMark
(2/2)
name LoC
core_list_join.c 496
core_matrix.c 308
core_stat.c 277
core_util.c 210
• CoreMark
vs.
Dhrystone
–Repor0ng
rule
–Use
of
library
calls,
e.g.,
malloc()
is
avoided
–CRC
to
make
sure
data
are
corrected
• However,
CoreMark
is
a
kernel
+
synthe0c
benchmark,
s0ll
quite
small
footprint
text data bss dec
18632 456 20 19108
27
28. So?
• Too
overcome
the
danger
of
placing
eggs
in
one
basket,
collec0ons
of
benchmark
applica0ons,
called
benchmark
suites,
are
popular
measure
of
performance
of
processors
with
variety
of
applica0ons
• Standard
Performance
Evalua0on
Corpora0on
(SPEC)
28
32. How
long
does
SPEC
CPU2000
take?
• About
1
hrs
to
compile
• Run0me:
Sum
of
base
run0me
mul0plied
by
3
– E.g.,
1.7
GHz
CA15,
(2256+3229)
x
3
=
16,455
s
~=
4.57
hr
– For
1.0
GHz:
4.57
x
1.7
=
7.77
hr
– For
CA7
assuming
twice
slower:
7.77
*
2
=
15.54
hr
Benchmark
Reference Base Base
Time Runtime Ratio
164.gzip 1400 215 652
175.vpr 1400 198 707
176.gcc 1100 94.8 1161
181.mcf 1800 266 677
186.crafty 1000 118 850
197.parser 1800 291 619
252.eon 1300 87.8 1480
253.perlbmk 1800 172 1045
254.gap 1100 107 1026
255.vortex 1900 211 899
256.bzip2 1500 203 740
300.twolf 3000 399 752
SPECint_base2000 2256 854
Benchmark
Reference Base Base
Time Runtime Ratio
68.wupwise 1600 162 991
171.swim 3100 389 797
172.mgrid 1800 339 532
173.applu 2100 241 870
177.mesa 1400 112 1254
178.galgel 2900 201 1444
179.art 2600 195 1332
183.equake 1300 157 828
187.facerec 1900 183 1036
188.ammp 2200 353 623
189.lucas 2000 134 1491
191.fma3d 2100 212 988
200.sixtrack 1100 241 456
301.apsi 2600 310 839
SPECfp_base2000 435 3229 909.6
32
33. Figure
1.16
SPEC2006
programs
and
the
evolu0on
of
the
SPEC
benchmarks
over
0me,
with
integer
programs
above
the
line
and
floa0ng-‐point
programs
below
the
line.
Of
the
12
SPEC2006
integer
programs,
9
are
wrihen
in
C,
and
the
rest
in
C++.
For
the
floa0ng-‐point
programs,
the
split
is
6
in
Fortran,
4
in
C++,
3
in
C,
and
4
in
mixed
C
and
Fortran.
The
figure
shows
all
70
of
the
programs
in
the
1989,
1992,
1995,
2000,
and
2006
releases.
The
benchmark
descrip0ons
on
the
les
are
for
SPEC2006
only
and
do
not
apply
to
earlier
versions.
Programs
in
the
same
row
from
different
genera0ons
of
SPEC
are
generally
not
related;
for
example,
fpppp
is
not
a
CFD
code
like
bwaves.
Gcc
is
the
senior
ci0zen
of
the
group.
Only
3
integer
programs
and
3
floa0ng-‐point
programs
survived
three
or
more
genera0ons.
Note
that
all
the
floa0ng-‐point
programs
are
new
for
SPEC2006.
Although
a
few
are
carried
over
from
genera0on
to
genera0on,
the
version
of
the
program
changes
and
either
the
input
or
the
size
of
the
benchmark
is
osen
changed
to
increase
its
running
0me
and
to
avoid
perturba0on
in
measurement
or
domina0on
of
the
execu0on
0me
by
some
factor
other
than
CPU
0me.
33
34. EEMBC
• Embedded
Microprocessor
Benchmark
Consor0um
(EEMBC):
41
kernels
used
to
predict
performance
of
different
embedded
applica0ons:
– Automo0ve/industrial
– Consumer
– Networking
– Office
automa0on
– Telecommunica0on
• 3rd
edi0on
showed
some
EEMBC
results,
4th
edi0on
changed
the
mind
• Unmodified
performance
and
“full-‐fury”
performance
• Kernel,
repor0ng
op0ons
– Not
a
good
predictor
of
rela0ve
performance
of
different
embedded
computers
34
35. Report
benchmark
results
• Reproducible
–Machine
configura0on
(Hardware,
sosware
(OS,
compiler
etc.))
• Summarizing
results
–You
should
not
add
different
numbers
• Some
use
weighted
average
–Ra0o,
compare
with
a
reference
machine
• Geometric
ra1o
–The
geometric
mean
of
the
ra0os
is
the
same
as
the
ra0os
of
geometric
means
–The
ra0o
of
the
geometric
means
is
equal
to
the
geometric
mean
of
the
performance
ra0os
35
37. • Fallacy:
Benchmarks
remain
valid
indefinitely
–Ability
to
resist
“benchmark
engineering”
or
“benchmarke0ng”
–gcc
is
the
only
survivor
from
SPEC89
• Almost
70%
of
all
programs
from
SPEC2000
or
earlier
were
dropped
from
the
next
release
37
38. Other
benchmarks
• Stream
–To
test
memory
bandwidth
–It
also
tests
floa0ng-‐point
performance
–Op0ons
of
floa0ng-‐point
(double,
8
bytes)
array
• copy,
scale,
add,
triad
• lmbench
–Micro
benchmark
to
measure
sosware/hardware
overhead
from
sosware
perspec0ve
–lmbench
paper
(1996),
hhp://www.bitmover.com/
lmbench/lmbench-‐usenix.pdf
name kernel bytes/iter FLOPS/iter
COPY a(i) = b(i) 16 0
SCALE a(i) = q*b(i) 16 1
SUM a(i) = b(i) + c(i) 24 1
TRIAD a(i) = b(i) + q*c(i) 24 2
38
40. lmbench
• lmbench
is
a
micro-‐benchmark
suite
designed
to
focus
ahen0on
on
the
basic
building
blocks
of
many
common
system
applica0ons,
such
as
databases,
simula0ons,
sosware
development,
and
networking
40
41. Parallel?
Let’s
look
at
other
SPEC
benchmarks
• SPECapc
for
3ds
Max™
2011,
performance
evalua0on
sosware
for
systems
running
Autodesk
3ds
Max
2011.
• SPECapcSM
for
Lightwave
3D
9.6,
performance
evalua0on
sosware
for
systems
running
NewTek
LightWave
3D
v9.6
sosware.
• SPECjbb2005,
evaluates
the
performance
of
server
side
Java
by
emula0ng
a
three-‐0er
client/server
system
(with
emphasis
on
the
middle
0er).
• SPECjEnterprise2010,
a
mul0-‐0er
benchmark
for
measuring
the
performance
of
Java
2
Enterprise
Edi0on
(J2EE)
technology-‐based
applica0on
servers.
• SPECjms2007,
Java
Message
Service
performance
• SPECjvm2008,
measuring
basic
Java
performance
of
a
Java
Run0me
Environment
on
a
wide
variety
of
both
client
and
server
systems.
• SPECapc,
performance
of
several
3D-‐intensive
popular
applica0ons
on
a
given
system
• SPEC
MPI2007,
for
evalua0ng
performance
of
parallel
systems
using
MPI
(Message
Passing
Interface)
applica0ons.
• SPEC
OMP2001
V3.2,
for
evalua0ng
performance
of
parallel
systems
using
OpenMP
(hhp://www.openmp.org)
applica0ons.
• SPECpower_ssj2008,
evaluates
the
energy
efficiency
of
server
systems.
• SPECsfs2008,
File
server
throughput
and
response
0me
suppor0ng
both
NFS
and
CIFS
protocol
access
• SPECsip_Infrastructure2011,
SIP
server
performance
• SPECviewperf
11,
performance
of
an
OpenGL
3D
graphics
system,
tested
with
various
rendering
tasks
from
real
applica0ons
• SPECvirt_sc2010
("SPECvirt"),
evaluates
the
performance
of
datacenter
servers
used
in
virtualized
server
consolida0on
41
42. PARSEC
• The
Princeton
Applica0on
Repository
for
Shared-‐Memory
Computers
(PARSEC)
is
a
benchmark
suite
composed
of
mul0threaded
programs.
The
suite
focuses
on
emerging
workloads
and
was
designed
to
be
representa0ve
of
next-‐genera0on
shared-‐memory
programs
for
chip-‐mul0processors
• Didn’t
really
use
it
yet
• hhp://parsec.cs.princeton.edu/
Workload
Parallelization Model
Pthreads OpenMP Intel TBB
blackscholes Yes Yes Yes
bodytrack Yes Yes Yes
canneal Yes No No
dedup Yes No No
facesim Yes No No
ferret Yes No No
fluidanimate Yes No Yes
freqmine No Yes No
raytrace Yes No No
streamcluster Yes No Yes
swaptions Yes No Yes
vips Yes No No
x264 Yes No No
42
43. Are Dhrystone usefully?
• Yes, if you know the limitation of them
• Don't do marketing as those benchmarks
mean real user perceived performance
43
47. Different items
• Example, GeekBench 3
• Arithmetic mean with different weight?
How?
• Good properties of geometric mean
47
48. Source code
• So far what we talked about are all
software with source code available, either
publicly/freely, e.g., Dhrystone or little
amount of $, e.g., SPEC CPU
48
51. Back to Android
• What kinds of Benchmarks are available, or used to
compare performance
• Apps with native benchmarks:Antutu, GeekBench
• Java apps, e.g., Quadrant
• Hybrid: with both native and Java, e.g.,AndEBench
and CF-Bench
• We also use SPEC CPU2000 and other stuff
internally
51
52. Ars Technica List
arrayOfPackageInfo[0]
=
new
PackageInfo("com.aurorasoftworks.quadrant.ui.standard",
false);
arrayOfPackageInfo[1]
=
new
PackageInfo("com.aurorasoftworks.quadrant.ui.advanced",
false);
arrayOfPackageInfo[2]
=
new
PackageInfo("com.aurorasoftworks.quadrant.ui.professional",
false);
arrayOfPackageInfo[3]
=
new
PackageInfo("com.redlicense.benchmark.sqlite",
false);
arrayOfPackageInfo[4]
=
new
PackageInfo("com.antutu.ABenchMark",
false);
arrayOfPackageInfo[5]
=
new
PackageInfo("com.greenecomputing.linpack",
false);
arrayOfPackageInfo[6]
=
new
PackageInfo("com.greenecomputing.linpackpro",
false);
arrayOfPackageInfo[7]
=
new
PackageInfo("com.glbenchmark.glbenchmark27",
false);
arrayOfPackageInfo[8]
=
new
PackageInfo("com.glbenchmark.glbenchmark25",
false);
arrayOfPackageInfo[9]
=
new
PackageInfo("com.glbenchmark.glbenchmark21",
false);
arrayOfPackageInfo[10]
=
new
PackageInfo("ca.primatelabs.geekbench2",
false);
arrayOfPackageInfo[11]
=
new
PackageInfo("com.eembc.coremark",
false);
arrayOfPackageInfo[12]
=
new
PackageInfo("com.flexycore.caffeinemark",
false);
arrayOfPackageInfo[13]
=
new
PackageInfo("eu.chainfire.cfbench",
false);
arrayOfPackageInfo[14]
=
new
PackageInfo("gr.androiddev.BenchmarkPi",
false);
arrayOfPackageInfo[15]
=
new
PackageInfo("com.smartbench.twelve",
false);
arrayOfPackageInfo[16]
=
new
PackageInfo("com.passmark.pt_mobile",
false);
arrayOfPackageInfo[17]
=
new
PackageInfo("se.nena.nenamark2",
false);
arrayOfPackageInfo[18]
=
new
PackageInfo("com.samsung.benchmarks",
false);
arrayOfPackageInfo[19]
=
new
PackageInfo("com.samsung.benchmarks:db",
false);
arrayOfPackageInfo[20]
=
new
PackageInfo("com.samsung.benchmarks:es1",
false);
arrayOfPackageInfo[21]
=
new
PackageInfo("com.samsung.benchmarks:es2",
false);
arrayOfPackageInfo[22]
=
new
PackageInfo("com.samsung.benchmarks:g2d",
false);
arrayOfPackageInfo[23]
=
new
PackageInfo("com.samsung.benchmarks:fs",
false);
arrayOfPackageInfo[24]
=
new
PackageInfo("com.samsung.benchmarks:ks",
false);
arrayOfPackageInfo[25]
=
new
PackageInfo("com.samsung.benchmarks:cpu
!
!
CPU and Memory related: Quadrant, Antutu,
linpack, GeekBench, AndEBench (coremark),
CaffeineMark, Pi, PassMark, Samsung’s benchmark
52
53. Antutu 3.x
• CPU: integer, floating point
• memory: RAM
• Graphics: 2D, 3D
• I/O: Database, SD read, SD write
!
!
• What are you benchmarking
• What's you workload
• How to calculate scores
53
54. What on earth are they
doing?
• Actually no public available information
• But, with good enough background
knowledge and proper tools (we’ll talk
about these later), we can figure it out
• It turns out most of them are from the
BYTE nbench (http://en.wikipedia.org/wiki/
NBench)
54
55. AnTuTu
3.x
CPU
and
Memory
Tests
nbench item Used by Antutu Antutu part
Antutu
percentage on
progress bar Order nbench category
NUMERIC SORT yes Integer 27% 4 integer
STRING SORT yes RAM 1% 1 memory
BITFIELD yes RAM 1% 2 memory
FP EMULATION no
FOURIER yes floating 47% 7 floating point
ASSIGNMENT yes RAM 8% 3 memory
IDEA yes Integer 27% 5 integer
HUFFMAN yes Integer 34% 6 integer
NEURAL NET no
LU DECOMPOSITION no
55
56. More
close
look
▪ RAM
– String sort:
• string Heap sort: StrHeapSort()
• MoveMemory() à memmove()
– Bit Field:
• Bit field test: DoBitops()
– Assignment:
• Task Assignment test: DoAssignment()
▪ Integer
– Numeric sort:
• Numeric heap sort: NumHeapSort()
– IDEA:
• IDEA encryption and decryption: cipher_idea()
– Huffman:
• Huffman encoding
▪ Floating point:
– Fourier:
• Fourier transform: pow(), sin(), cos()
56
57. for(i=top; i>0; --i)!
{!
"strsift(optrarray,strarray,numstrings,0,i);!
!
"/* temp = string[0] */!
"tlen=*strarray;!
"MoveMemory((farvoid *)&temp[0], /* Perform exchange */!
" "(farvoid *)strarray,!
" "(unsigned long)(tlen+1));!
!
!
"/* string[0]=string[i] */!
"tlen=*(strarray+*(optrarray+i));!
"stradjust(optrarray,strarray,numstrings,0,tlen);!
"MoveMemory((farvoid *)strarray,!
" "(farvoid *)(strarray+*(optrarray+i)),!
" "(unsigned long)(tlen+1));!
!
"/* string[i]=temp */!
"tlen=temp[0];!
"stradjust(optrarray,strarray,numstrings,i,tlen);!
"MoveMemory((farvoid *)(strarray+*(optrarray+i)),!
" "(farvoid *)&temp[0],!
" "(unsigned long)(tlen+1));!
!
}
String Sort in NBench
• Sorts an array of strings
of arbitrary length
• Test memory movement
performance
• Non-sequential
performance of cache,
with added burden that
moves are byte-wide and
can occur on odd
address boundaries
57
58. Bit field in NBench
• Executes 3 bit manipulation functions
• Exercises "bit twiddling“ performance. Travels through
memory bit-by-bit in a sequential fashion; different from sorts
in that data is merely altered in place
• Operations:
• Set: OR 1
• Clear: AND 0
• Toggle: XOR
• Set, clear: ToggleBitRun()
• Toggle: FlipBitRun()
static void ToggleBitRun(farulong *bitmap, /* Bitmap */
ulong bit_addr, /* Address of bits to set */
ulong nbits, /* # of bits to set/clr */
uint val) /* 1 or 0 */
{
unsigned long bindex; /* Index into array */
unsigned long bitnumb; /* Bit number */
!
while(nbits--)
{
#ifdef LONG64
bindex=bit_addr>>6; /* Index is number /64 */
bitnumb=bit_addr % 64; /* Bit number in word */
#else
bindex=bit_addr>>5; /* Index is number /32 */
bitnumb=bit_addr % 32; /* bit number in word */
#endif
if(val)
bitmap[bindex]|=(1L<<bitnumb);
else
bitmap[bindex]&=~(1L<<bitnumb);
bit_addr++;
}
return;
}
58
59. Assignment in NBench
• The test moves through
large integer arrays in both
row-wise and column-wise
fashion. Cache/memory
with good sequential
performance should see a
boost (memory is altered in
place -- no moving as in a
sort operation)
• Yes, basically, sequential
array assignment with some
kind of table look-ups
/*
** Step through rows. For each one that is not currently
** assigned, see if the row has only one zero in it. If so,
** mark that as an assigned row/col. Eliminate other zeros
** in the same column.
*/
for(i=0;i<ASSIGNROWS;i++)
{ numzeros=0;
for(j=0;j<ASSIGNCOLS;j++)
if(tableau[i][j]==0L)
if(assignedtableau[i][j]==0)
{ numzeros++;
selected=j;
}
if(numzeros==1)
{ numassigns++;
totnumassigns++;
assignedtableau[i][selected]=1;
for(k=0;k<ASSIGNROWS;k++)
if((k!=i) &&
(tableau[k][selected]==0))
assignedtableau[k][selected]=2;
}
}
59
60. Numeric Sort in NBench
• Sorts an array of long
integers with heap sort
• Generic integer
performance. Should
exercise non-sequential
performance of cache
(or memory if cache is
less than 8K). Moves 32-
bit longs at a time, so
16-bit processors will be
at a disadvantage
static void NumHeapSort(farlong *array,
ulong bottom, /* Lower bound */
ulong top) /* Upper bound */
{
ulong temp; /* Used to exchange elements */
ulong i; /* Loop index */
!
/*
** First, build a heap in the array
*/
for(i=(top/2L); i>0; --i)
NumSift(array,i,top);
!
/*
** Repeatedly extract maximum from heap and place it at the
** end of the array. When we get done, we'll have a sorted
** array.
*/
for(i=top; i>0; --i)
{ NumSift(array,bottom,i);
temp=*array; /* Perform
exchange */
*array=*(array+i);
*(array+i)=temp;
}
return;
60
61. static void cipher_idea(u16 in[4],!
" "u16 out[4],!
" "register IDEAkey Z)!
{!
register u16 x1, x2, x3, x4, t1, t2;!
/* register u16 t16;!
register u16 t32; */!
int r=ROUNDS;!
!
x1=*in++;!
x2=*in++;!
x3=*in++;!
x4=*in;!
!
do {!
"MUL(x1,*Z++);!
"x2+=*Z++;!
"x3+=*Z++;!
"MUL(x4,*Z++);!
!
"t2=x1^x3;!
"MUL(t2,*Z++);!
"t1=t2+(x2^x4);!
"MUL(t1,*Z++);!
"t2=t1+t2;!
!
"x1^=t1;!
"x4^=t2;!
!
"t2^=x2;!
"x2=x3^t1;!
"x3=t2;!
} while(--r);!
MUL(x1,*Z++);!
*out++=x1;!
*out++=x3+*Z++;!
*out++=x2+*Z++;!
MUL(x4,*Z);!
*out=x4;!
return;!
}
IDEA Encryption in NBench
• IDEA: a new block
cipher when nbench was
in development
• Moves through data
sequentially in 16-bit
chunks
61
62. Huffman in NBench
• Everybody knows Huffman code, right?
• A combination of byte operations, bit twiddling, and overall integer
manipulation
.....
/*
** Huffman tree built...compress the plaintext
*/
bitoffset=0L; /* Initialize bit offset */
for(i=0;i<arraysize;i++)
{
c=(int)plaintext[i]; /* Fetch character */
/*
** Build a bit string for byte c
*/
bitstringlen=0;
while(hufftree[c].parent!=-2)
{ if(hufftree[hufftree[c].parent].left==c)
bitstring[bitstringlen]='0';
else
bitstring[bitstringlen]='1';
c=hufftree[c].parent;
bitstringlen++;
}
.....
62
63. Fourier in NBench
• No, not FFT,
• Good measure of transcendental and trigonometric performance of FPU. Little array
activity, so this test should not be dependent of cache or memory architecture
static double thefunction(double x, /* Independent variable */!
" "double omegan, /* Omega * term */!
" "int select) /* Choose term */!
{!
/*!
** Use select to pick which function we call.!
*/!
switch(select)!
{!
"case 0: return(pow(x+(double)1.0,x));!
"case 1: return(pow(x+(double)1.0,x) * cos(omegan * x));!
"case 2: return(pow(x+(double)1.0,x) * sin(omegan * x));!
}
63
64. Neural Net in NBench
• A robust algorithm for
solving linear equations
• Small-array floating-point
test heavily dependent
on the exponential
function; less dependent
on overall FPU
performance
64
65. LU Decomposition in NBench
• LU Decomposition
• Yes, the LU decomposition
you learned in linear
algebra
• A floating-point test that
moves through arrays in
both row-wise and
column-wise fashion.
Exercises only fundamental
math operations (+, -, *, /)
65
66. GeekBench
• A cross-platform one
• The only publicly available one we could use to compare
Android, iOS, and other platforms
• Quite clearly described test items
• http://support.primatelabs.com/kb/geekbench/geekbench-3-
benchmarks
• Explaining how to interpret results
• http://support.primatelabs.com/kb/geekbench/interpreting-
geekbench-3-scores
• Source code available if you pay
66
67. Vellamo
• HTML5
• Metal: Dhrystone, Linpack, Branch-K, Stream
5.9, RamJam, Storage
• some are well-known; some are written
by Quic?
• Anyway, all of them are described at http://
www.quicinc.com/vellamo/test-descriptions/
67
68. CFBench
• Used by some people,‘cause
• Test both Java and native version
• its author is quite active in xda developer forum
• Some problems
• no good description of tests
• some code is wrong, e.g.,
• its Native Memory Read test is not testing memory
read,‘cause malloc()ed array is not initialized
68
71. • In the good old days, we have source code, we compile and run
benchmark programs
• In current Android ecosystem
• Usually we don’t have source
• Profiling: oprofile, perf, DS-5
• profiling sometimes doesn’t report real bottleneck
function, e.g., static functions usually are inlined and don’t
have symbol in shipped binaries
• binutils: nm, readelf, objdump, gdb
• Improving libraries, e.g., libc and libm, and runtime system, e.g.,
JIT of Dalvik, used by those benchmarks
71
72. Antutu 3.x
• memmove() in bionic --> bcopy() in C
• rewrite with NEON assembly code
• pow(), sin(), cos() in C
• rewrite them with assembly
72
73. bcopy() in bionic
• MoveMemory() in nbench
-> memmove() in bionic -
> bcopy() in bionic
• memcpy() assembly in
bionic and there are
processor specific ones
(CA9, CA15, Krait).
NEON (vector load/
store) helps
• not for bcopy()
in bionic/libc/bionic/memmove.c
!
void *memmove(void *dst, const void *src, size_t n)
{
const char *p = src;
char *q = dst;
/* We can use the optimized memcpy if the source and destination
* don't overlap.
*/
if (__builtin_expect(((q < p) && ((size_t)(p - q) >= n))
|| ((p < q) && ((size_t)(q - p) >= n)), 1)) {
return memcpy(dst, src, n);
} else {
bcopy(src, dst, n);
return dst;
}
}
in bionic/libc/string/bcopy.c
/*
* Copy a block of memory, handling overlap.
* This is the routine that actually implements
* (the portable versions of) bcopy, memcpy, and memmove.
*/
#ifdef MEMCOPY
void *
memcpy(void *dst0, const void *src0, size_t length)
#else
#ifdef MEMMOVE
void *
memmove(void *dst0, const void *src0, size_t length)
#else
void
bcopy(const void *src0, void *dst0, size_t length)
#endif
#endif
{
.....
73
74. Antutu 3.x
• For people with source code
• Selection of toolchain and compiler options
may cause huge difference, e.g., bit field
• Some version of x86 binary for Antutu
3.x was compiled with Intel, bit-by-bit
operations turned in word-wide (32-bit)
operations, and the speed up is about 70x
faster
74
76. remote gdb
1. get /system/bin/app_process and /system/bin/linker of the target system and necessary
shared libraries, e.g., /data/data/eu.chainfire.cfbench/lib/libCFBench.so
• adb pull /system/bin/app_process!
• adb pull /system/bin/linker lib/armeabi-v7a/!
• adb pull /data/data/eu.chainfire.cfbench/lib/libCFBench.so lib/
armeabi-v7a/!
2. arm-linux-gnueabi-gdb ./app_process
3. on the target device, attach gdbserver to the running process you wanna debug
• ./gdbserver --attach :5039 3484
4. set shared library search path
• (gdb) set solib-search-path /Users/freedom/tmp/cfbench/lib/armeabi-v7a
5. ‘adb forward tcp:5039 tcp:5039’ and set remote target
• (gdb) target remote :5039
6. you can set breakpoints, print backtrace, disassemble, etc.
76
78. Quadrant
• Written in Java
• CPU: Not really testing CPU
• Memory: profiling shows that memcpy() is
heavily in used
• What can we do
• optimized JIT part of DVM
78
80. Wrap-up
• Popular CPU and Memory benchmarks on
Android mostly don’t reflect real CPU
performance
• We know CPU performance != System
performance != user-perceived
performance
• There is always room for improvement
80
82. Recent progress
• EEMBC’s AndEBench 2.0 is under development (http://
www.eembc.org/press/pressrelease/130128.html)
• Qualcomm asked BDTi to develop new benchmark
(http://www.qualcomm.com/media/blog/2013/08/16/
mobile-benchmarking-turning-corner-user-
experience).
• Samsung with other vendors launched MobileBench
Consortium last year
• Antutu is still growing
82
84. 廣告
• MediaTek joined
linaro.org last month
• linaro.org is a NPO
working on open source
Linux/Android related
stuff for ARM-based
SoCs
• So MTK is getting more
open recently
• And, it’s looking for
open source engineers
• Talk to guys at MTK
booth or me
• There are more non-
open source jobs
84
86. Some References to Understand
Performance Benchmark
• Raj Jain,“The Art of Computer Systems Performance
Analysis:Techniques for Experimental Design,
Measurement, Simulation, and Modeling”,Wiley, 1991
• Quantitative Approach
• A good SPEC introduction article, http://mrob.com/
pub/comp/benchmarks/spec.html
• Kaivalya M. Dixit,“Overview of the SPEC
Benchmarks,” http://people.cs.uchicago.edu/~chliu/
doc/benchmark/chapter9.pdf
86
87. Basic system parameters
------------------------------------------------------------------------------
Host OS Description Mhz tlb cache mem scal
pages line par load
bytes
--------- ------------- ----------------------- ---- ----- ----- ------ ----
localhost Linux 3.4.5-g armv7l-linux-gnu 1696 7 64 4.4700 1
!
Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host OS Mhz null null open slct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
localhost Linux 3.4.5-g 1696 0.49 0.67 2.54 5.95 8.52 0.67 5.05 876. 1668 4654
!
Basic integer operations - times in nanoseconds - smaller is better
-------------------------------------------------------------------
Host OS intgr intgr intgr intgr intgr
bit add mul div mod
--------- ------------- ------ ------ ------ ------ ------
localhost Linux 3.4.5-g 1.0700 0.1100 3.4000 90.5 14.8
!
Basic float operations - times in nanoseconds - smaller is better
-----------------------------------------------------------------
87
88. Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
localhost Linux 3.4.5-g 8.9700 4.9000 6.1400 12.3 7.68000 57.6
!
*Local* Communication latencies in microseconds - smaller is better
---------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
localhost Linux 3.4.5-g 8.970 17.6 23.9 47.5 71.3 357.
!
File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page 100fd
Create Delete Create Delete Latency Fault Fault selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
localhost Linux 3.4.5-g 700.0 1.259 2.55270 3.048
!
*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
88
89. PARSEC
content
• Blackscholes
This
applica0on
is
an
Intel
RMS
benchmark.
It
calculates
the
prices
for
a
por|olio
of
European
op0ons
analy0cally
with
the
Black-‐Scholes
par1al
differen1al
equa1on
(PDE).
There
is
no
closed-‐form
expression
for
the
Black-‐
Scholes
equa0on
and
as
such
it
must
be
computed
numerically.
• Bodytrack
This
computer
vision
applica0on
is
an
Intel
RMS
workload
which
tracks
a
human
body
with
mul1ple
cameras
through
an
image
sequence.
This
benchmark
was
included
due
to
the
increasing
significance
of
computer
vision
algorithms
in
areas
such
as
video
surveillance,
character
anima0on
and
computer
interfaces.
• Canneal
This
kernel
was
developed
by
Princeton
University.
It
uses
cache-‐aware
simulated
annealing
(SA)
to
minimize
the
rou1ng
cost
of
a
chip
design.
Canneal
uses
fine-‐grained
parallelism
with
a
lock-‐free
algorithm
and
a
very
aggressive
synchroniza0on
strategy
that
is
based
on
data
race
recovery
instead
of
avoidance.
• Dedup
This
kernel
was
developed
by
Princeton
University.
It
compresses
a
data
stream
with
a
combina1on
of
global
and
local
compression
that
is
called
'deduplica1on'.
The
kernel
uses
a
pipelined
programming
model
to
mimic
real-‐world
implementa0ons.
The
reason
for
the
inclusion
of
this
kernel
is
that
deduplica0on
has
become
a
mainstream
method
for
new-‐genera0on
backup
storage
systems.
• Facesim
This
Intel
RMS
applica0on
was
originally
developed
by
Stanford
University.
It
computes
a
visually
realis1c
anima1on
of
the
modeled
face
by
simula1ng
the
underlying
physics.
The
workload
was
included
in
the
benchmark
suite
because
an
increasing
number
of
anima0ons
employ
physical
simula0on
to
create
more
realis0c
effects.
• Ferret
This
applica0on
is
based
on
the
Ferret
toolkit
which
is
used
for
content-‐based
similarity
search.
It
was
developed
by
Princeton
University.
The
reason
for
the
inclusion
in
the
benchmark
suite
is
that
it
represents
emerging
next-‐
genera0on
search
engines
for
non-‐text
document
data
types.
In
the
benchmark,
we
have
configured
the
Ferret
toolkit
for
image
similarity
search.
Ferret
is
parallelized
using
the
pipeline
model.
89
90. PARSEC
content
• Fluidanimate
This
Intel
RMS
applica0on
uses
an
extension
of
the
Smoothed
Par0cle
Hydrodynamics
(SPH)
method
to
simulate
an
incompressible
fluid
for
interac1ve
anima1on
purposes.
It
was
included
in
the
PARSEC
benchmark
suite
because
of
the
increasing
significance
of
physics
simula0ons
for
anima0ons.
• Freqmine
This
applica0on
employs
an
array-‐based
version
of
the
FP-‐growth
(Frequent
PaMern-‐growth)
method
for
Frequent
Itemset
Mining
(FIMI).
It
is
an
Intel
RMS
benchmark
which
was
originally
developed
by
Concordia
University.
Freqmine
was
included
in
the
PARSEC
benchmark
suite
because
of
the
increasing
use
of
data
mining
techniques.
• Raytrace
The
Intel
RMS
applica0on
uses
a
version
of
the
raytracing
method
that
would
typically
be
employed
for
real-‐
0me
anima0ons
such
as
computer
games.
It
is
op0mized
for
speed
rather
than
realism.
The
computa0onal
complexity
of
the
algorithm
depends
on
the
resolu0on
of
the
output
image
and
the
scene.
• Streamcluster
This
RMS
kernel
was
developed
by
Princeton
University
and
solves
the
online
clustering
problem.
Streamcluster
was
included
in
the
PARSEC
benchmark
suite
because
of
the
importance
of
data
mining
algorithms
and
the
prevalence
of
problems
with
streaming
characteris0cs.
• Swap1ons
The
applica0on
is
an
Intel
RMS
workload
which
uses
the
Heath-‐Jarrow-‐Morton
(HJM)
framework
to
price
a
porRolio
of
swap1ons.
Swap0ons
employs
Monte
Carlo
(MC)
simula0on
to
compute
the
prices.
• Vips
This
applica0on
is
based
on
the
VASARI
Image
Processing
System
(VIPS)
which
was
originally
developed
through
several
projects
funded
by
European
Union
(EU)
grants.
The
benchmark
version
is
derived
from
a
print
on
demand
service
that
is
offered
at
the
Na0onal
Gallery
of
London,
which
is
also
the
current
maintainer
of
the
system.
The
benchmark
includes
fundamental
image
opera0ons
such
as
an
affine
transforma0on
and
a
convolu0on.
• X264
90