Understanding Android Benchmarks

Understanding Android
Benchmarks
“freedom” koan-sin tan

freedom@computer.org

OSDC.tw,Taipei

Apr 11th, 2014
1

disclaimers
• many of the materials used in this slide
deck are from the Internet and textbooks,
e.g., many of the following materials are
from “Computer Architecture: A
Quantitative Approach,” 1st ~ 5th ed

• opinions expressed here are my personal
one, don’t reﬂect my employer’s view
2

who am i
• did some networking and security research before

• working for a SoC company, recently on

• big.LITTLE scheduling and related stuff

• parallel construct evaluation

• run benchmarking from time to time

• for improving performance of our products, and

• know what our colleagues' progress
3

• Focusing on CPU and memory parts of
benchmarks

• let’s ignore graphics (2d, 3d), storage I/O,
etc.
4

Blackbox
!
• google image search “benchmark”, you can
ﬁnd many of them are Android-related
benchmarks

• Similar to recently Cross-Strait Trade in
Services Agreement (TiSA), most
benchmarks on Android platform are kinda
blackbox
5

Is Apple A7 good?
• When Apple released the new
iPhone 5s, you saw many
technical blog showed some
benchmarks for reviews they
came up

• commonly used ones:

• GeekBench

• JavaScript benchmarks

• Some graphics benchmarks

• Why? Are they right ones? etc.
e.g., http://www.anandtech.com/show/7335/the-iphone-5s-review
6

No, not improvement in this way
9
http://
www.anandtech.com
/show/7384/state-of-
cheating-in-android-
benchmarks

Assuming there is not
cheating, what we we
can do?

Outline
• Performance benchmark review
• Some Android benchmarks

• What we did and what still can be done

• Future
11

To quote what Prof. Raj Jain quoted
• Benchmark v. trans.
To subject (a system)
to a series of tests in
order to obtain
prearranged results
not available on
competitive systems
From:“The Devil’s DP
Dictionary” S. Kelly-Bootle
12

Why benchmarking
• We did something good, let check if we did
it right

• comparing with own previous results to
see if we break anything

• We want to know how good our
colleagues in other places are
13

What to report?
• Usually, what we mean by “benchmarking”
is to measure performance

• What to report?

• intuitive answer: how many things we do
in certain period of time

• yes, time. E.g., MIPS, MFLOPS, MiB/s, bps
14

MIPS and MFLOPS
• MIPS
(Million
Instruc0ons
per
Second),
MFLOPS
(Million

Floa0ng-‐Point
Opera0ons
per
Second)

• All
instruc0ons
are
not
created
equal

– CISC
machine
instruc0ons
usually
accomplish
a
lot
more
than

those
of
RISC
machines,
comparing
the
instruc0ons
of
a
CISC

machine
and
a
RISC
machine
is
similar
to
comparing
La0n
and

Greek
15

MIPS
and
what’s
wrong
with
them
• MIPS
is
instruc0on
set
dependent,
making
it

diﬃcult
to
compare
MIPS
of
one
computers
with

diﬀerent
ISA

• MIPS
varies
between
programs
on
the
same

computers;

and
most
importantly,

• MIPS
can
vary
inversely
to
performance

–w/
hardware
FP,
generally,
MIPS
is
smaller
16

MFLOPS
and
what’s
wrong
with
them
• Applied
only
to
programs
with
floa0ng-‐point

opera0ons

• Opera0ons
instead
of
instruc0ons,
but
s0ll

–floa0ng-‐point
instruc0ons
are
different
on
machines

different
ISAs

–Fast
and
slow
floa0ng-‐point
opera0ons

• Possible
solu0on:
weight
and
source
code
level

count

–ADD,
SUB,
COMPARE
:
1

–DIVIDE,

SQRT:
2

–EXP,
SIN:
4
17

• The
best
choice
of
benchmarks
to
measure

performance
is
real
applica0ons
18

Problema0c
benchmarks
• Kernel:
small,
key
pieces
of
real
applica0ons,
e.g.,

linpack

• Toy
programs:
100-‐line
programs
from
beginning

programming
assignments,
e.g.,
quicksort

• Synthe0c
benchmarks:
fake
programs
invented
to

try
to
match
the
proﬁle
and
behavior
of
really

applica0ons,
e.g.,
Dhrystone
19

Why
they
are
disreputed?
• Small,
ﬁt
in
cache

• Obsolete
instruc0on
mix

• Uncontrolled
source
code

• Prone
to
compiler
tricks

• Short
run0mes
on
modern
machines

• Single-‐number
performance
characteriza0on
with
a

single
benchmark

• Diﬃcult
to
reproduce
results
(short
run0me
and

low-‐precision
UNIX
0mer)
20

Dhrystone
• Source

–hhp://homepages.cwi.nl/~steven/dry.c

• <
1000
LoC

–Size
of
CA15
binary
compiled
with
bionic

• Instruc0ons:
~
14
KiB
text data bss dec
13918 467 10266 24660
21

Whetstone
• Dhrystone
is
a
pun
on

Whetstone

• Source
code:
hhp://
www.netlib.org/
benchmark/whetstone.c
Test MFLOPS MOPS ms
N1 float 119.78 0.16
N2 float 171.98 0.78
N3 if 154.25 0.67
N4 fixpt 397.48 0.79
N5 cos 19.08 4.36
N6 float 84.22 6.41
N7 equal 86.84 2.13
N8 exp 5.95 6.26
MWIPS 463.97 21.55
22

More
on
Synthe0c
benchmarks
• The
best
known
examples
of
synthe0c
benchmarks
are
Whetstone
and

Dhrystone

• Problems:

– Compiler
and
hardware
op0miza0ons
can
ar0ficially
inflate
performance
of
these

benchmarks
but
not
of
real
programs

– The
other
side
of
the
coin
is
that
because
these
benchmarks
are
not
natural
programs,

they
don’t
reward
op0miza0ons
of
behaviors
that
occur
in
real
programs

• Examples:

– Op0mizing
compilers
can
discard
25%
of
the
Dhrystone
code;
examples
include
loops

that
are
only
executed
once,
making
the
loop
overhead
instruc0ons
unnecessary

– Most
Whetstone
floa0ng-‐point
loops
execute
small
numbers
of
0mes
or
include
calls

inside
the
loop.
These
characteris0cs
are
different
from
many
real
programs

– Some
more
discussion
in
1st
edi0on
of
the
textbook
23

LINPACK
• LINPACK:
a
ﬂoa0ng
point
benchmark
from
the

manual
of
LINPACK
library

• Source

–hhp://www.netlib.org/benchmark/linpackc

–hhp://www.netlib.org/benchmark/linpackc.new

• 883
LoC

–Size
of
CA15
binary
compiled
with
bionic

• Instruc0ons:
~
13
KiB
text data bss dec
12670 408 0 13086
24

CoreMark
(1/2)
• CoreMark
is
a
benchmark
that
aims
to
measure
the

performance
of
central
processing
units
(CPU)
used

in
embedded
systems.
It
was
developed
in
2009
by
Shay
Gal-‐On

at
EEMBC
and
is
intended
to
become
an
industry
standard,

replacing
the
an0quated
Dhrystone
benchmark

• The
code
is
wrihen
in
C
code
and
contains
implementa0ons
of

the
following
algorithms:

– Linked
list
processing.

– Matrix
(mathema0cs)
manipula0on
(common
matrix
opera0ons),

– state
machine
(determine
if
an
input
stream
contains
valid
numbers),

and

– CRC

• from
wikipedia
26

CoreMark
(2/2)
name LoC
core_list_join.c 496
core_matrix.c 308
core_stat.c 277
core_util.c 210
• CoreMark
vs.
Dhrystone

–Repor0ng
rule

–Use
of
library
calls,
e.g.,

malloc()
is
avoided

–CRC
to
make
sure
data
are

corrected

• However,
CoreMark
is
a

kernel
+
synthe0c

benchmark,
s0ll
quite

small
footprint
text data bss dec
18632 456 20 19108
27

So?
• Too
overcome
the
danger
of
placing
eggs
in
one

basket,
collec0ons
of
benchmark
applica0ons,

called
benchmark
suites,
are
popular
measure
of

performance
of
processors
with
variety
of

applica0ons

• Standard
Performance
Evalua0on
Corpora0on

(SPEC)
28

Why
CPU2000
in
2010s?
• Why
ARM
s0cks
with
SPEC
CPU2000
instead
of

CPU2006

–1999
q4
results,
earliest
available
CPU2000
results
(hhp://
www.spec.org/cpu2000/results/res1999q4/)

• CINT2000
base:
133
–
424

• CFP2000
base:
126
–
514

–2005
Opteron
144,
1.8
GHz

• 1,440
(CA15
1.9
GHz
reported
nVidia
is
1,168)

–CPU2006
requires
much
more
DRAM,
1
GiB
DRAM
is
not

enough
name CA9 CA7 CA15 Krait
SPECint 200 356 320 537 326
SPECfp 2000 298 236 567 350
All normalized to 1.0 GHz
30

SPEC
numbers
from
Quan0ta0ve

Approach
5th
Edi0on
31

How
long
does
SPEC
CPU2000
take?
• About
1
hrs
to
compile

• Run0me:
Sum
of
base

run0me
mul0plied
by
3

– E.g.,
1.7
GHz
CA15,

(2256+3229)
x
3
=

16,455
s
~=

4.57
hr

– For
1.0
GHz:
4.57
x
1.7
=
7.77

hr

– For
CA7
assuming
twice
slower:

7.77
*
2
=
15.54
hr
Benchmark
Reference Base Base
Time Runtime Ratio
164.gzip 1400 215 652
175.vpr 1400 198 707
176.gcc 1100 94.8 1161
181.mcf 1800 266 677
186.crafty 1000 118 850
197.parser 1800 291 619
252.eon 1300 87.8 1480
253.perlbmk 1800 172 1045
254.gap 1100 107 1026
255.vortex 1900 211 899
256.bzip2 1500 203 740
300.twolf 3000 399 752
SPECint_base2000 2256 854
Benchmark
Reference Base Base
Time Runtime Ratio
68.wupwise 1600 162 991
171.swim 3100 389 797
172.mgrid 1800 339 532
173.applu 2100 241 870
177.mesa 1400 112 1254
178.galgel 2900 201 1444
179.art 2600 195 1332
183.equake 1300 157 828
187.facerec 1900 183 1036
188.ammp 2200 353 623
189.lucas 2000 134 1491
191.fma3d 2100 212 988
200.sixtrack 1100 241 456
301.apsi 2600 310 839
SPECfp_base2000 435 3229 909.6
32

Figure
1.16
SPEC2006
programs
and
the
evolu0on
of
the
SPEC
benchmarks
over
0me,
with
integer
programs
above
the
line
and
floa0ng-‐point

programs
below
the
line.
Of
the
12
SPEC2006
integer
programs,
9
are
wrihen
in
C,
and
the
rest
in
C++.
For
the
floa0ng-‐point
programs,
the
split
is
6

in
Fortran,
4
in
C++,
3
in
C,
and
4
in
mixed
C
and
Fortran.
The
figure
shows
all
70
of
the
programs
in
the
1989,
1992,
1995,
2000,
and
2006
releases.

The
benchmark
descrip0ons
on
the
les
are
for
SPEC2006
only
and
do
not
apply
to
earlier
versions.
Programs
in
the
same
row
from
different

genera0ons
of
SPEC
are
generally
not
related;
for
example,
fpppp
is
not
a
CFD
code
like
bwaves.
Gcc
is
the
senior
ci0zen
of
the
group.
Only
3
integer

programs
and
3
floa0ng-‐point
programs
survived
three
or
more
genera0ons.
Note
that
all
the
floa0ng-‐point
programs
are
new
for
SPEC2006.

Although
a
few
are
carried
over
from
genera0on
to
genera0on,
the
version
of
the
program
changes
and
either
the
input
or
the
size
of
the
benchmark

is
osen
changed
to
increase
its
running
0me
and
to
avoid
perturba0on
in
measurement
or
domina0on
of
the
execu0on
0me
by
some
factor
other

than
CPU
0me.

33

EEMBC
• Embedded
Microprocessor
Benchmark
Consor0um

(EEMBC):
41
kernels

used
to
predict
performance
of
different
embedded
applica0ons:

– Automo0ve/industrial

– Consumer

– Networking

– Office
automa0on

– Telecommunica0on

• 3rd
edi0on
showed
some
EEMBC
results,
4th
edi0on
changed
the
mind

• Unmodified
performance
and
“full-‐fury”
performance

• Kernel,
repor0ng
op0ons

– Not
a
good
predictor
of
rela0ve
performance
of
different
embedded
computers
34

Report
benchmark
results
• Reproducible

–Machine
conﬁgura0on
(Hardware,
sosware
(OS,
compiler
etc.))

• Summarizing
results

–You
should
not
add
diﬀerent
numbers

• Some
use
weighted
average

–Ra0o,
compare
with
a
reference
machine

• Geometric
ra1o

–The
geometric
mean
of
the
ra0os
is
the
same
as
the
ra0os
of

geometric
means

–The
ra0o
of
the
geometric
means
is
equal
to
the
geometric
mean

of
the
performance
ra0os
35

• Fallacy:
Benchmarks
remain
valid
indeﬁnitely

–Ability
to
resist
“benchmark
engineering”
or

“benchmarke0ng”

–gcc
is
the
only
survivor
from
SPEC89

• Almost
70%
of
all
programs
from
SPEC2000
or
earlier
were

dropped
from
the
next
release
37

Other
benchmarks
• Stream

–To
test
memory
bandwidth

–It
also
tests
ﬂoa0ng-‐point
performance

–Op0ons
of
ﬂoa0ng-‐point
(double,
8
bytes)
array

• copy,
scale,
add,
triad

• lmbench

–Micro
benchmark
to
measure
sosware/hardware

overhead
from
sosware
perspec0ve

–lmbench
paper
(1996),
hhp://www.bitmover.com/
lmbench/lmbench-‐usenix.pdf
name kernel bytes/iter FLOPS/iter
COPY a(i) = b(i) 16 0
SCALE a(i) = q*b(i) 16 1
SUM a(i) = b(i) + c(i) 24 1
TRIAD a(i) = b(i) + q*c(i) 24 2
38

Stream 5.10
for (k=0; k<NTIMES; k++)
{
times[0][k] = mysecond();
for (j=0; j<STREAM_ARRAY_SIZE; j++)
c[j] = a[j];
times[0][k] = mysecond() - times[0][k];
b[j] = scalar*c[j];
c[j] = a[j]+b[j];
a[j] = b[j]+scalar*c[j];
}
39

lmbench
• lmbench
is
a
micro-‐benchmark
suite
designed
to

focus
ahen0on
on
the
basic
building
blocks
of

many
common
system
applica0ons,
such
as

databases,
simula0ons,
sosware
development,

and
networking
40

Parallel?
Let’s
look
at
other
SPEC

benchmarks
• SPECapc
for
3ds
Max™
2011,
performance
evalua0on
sosware
for
systems
running
Autodesk
3ds
Max
2011.

• SPECapcSM
for
Lightwave
3D
9.6,
performance
evalua0on
sosware
for
systems
running
NewTek
LightWave
3D
v9.6

sosware.

• SPECjbb2005,
evaluates
the
performance
of
server
side
Java
by
emula0ng
a
three-‐0er
client/server
system
(with

emphasis
on
the
middle
0er).

• SPECjEnterprise2010,
a
mul0-‐0er
benchmark
for
measuring
the
performance
of
Java
2
Enterprise
Edi0on
(J2EE)

technology-‐based
applica0on
servers.

• SPECjms2007,
Java
Message
Service
performance

• SPECjvm2008,
measuring
basic
Java
performance
of
a
Java
Run0me
Environment
on
a
wide
variety
of
both
client
and

server
systems.

• SPECapc,
performance
of
several
3D-‐intensive
popular
applica0ons
on
a
given
system

• SPEC
MPI2007,
for
evalua0ng
performance
of
parallel
systems
using
MPI
(Message
Passing
Interface)
applica0ons.

• SPEC
OMP2001
V3.2,
for
evalua0ng
performance
of
parallel
systems
using
OpenMP
(hhp://www.openmp.org)

applica0ons.

• SPECpower_ssj2008,
evaluates
the
energy
eﬃciency
of
server
systems.

• SPECsfs2008,
File
server
throughput
and
response
0me
suppor0ng
both
NFS
and
CIFS
protocol
access

• SPECsip_Infrastructure2011,
SIP
server
performance

• SPECviewperf
11,
performance
of
an
OpenGL
3D
graphics
system,
tested
with
various
rendering
tasks
from
real

applica0ons

• SPECvirt_sc2010
("SPECvirt"),
evaluates
the
performance
of
datacenter
servers
used
in
virtualized
server
consolida0on

41

PARSEC
• The
Princeton
Applica0on

Repository
for
Shared-‐Memory

Computers
(PARSEC)
is
a

benchmark
suite
composed
of

mul0threaded
programs.
The

suite
focuses
on
emerging

workloads
and
was
designed
to
be

representa0ve
of
next-‐genera0on

shared-‐memory
programs
for

chip-‐mul0processors

• Didn’t
really
use
it
yet

• hhp://parsec.cs.princeton.edu/
Workload
Parallelization Model
Pthreads OpenMP Intel TBB
blackscholes Yes Yes Yes
bodytrack Yes Yes Yes
canneal Yes No No
dedup Yes No No
facesim Yes No No
ferret Yes No No
fluidanimate Yes No Yes
freqmine No Yes No
raytrace Yes No No
streamcluster Yes No Yes
swaptions Yes No Yes
vips Yes No No
x264 Yes No No
42

Are Dhrystone usefully?
• Yes, if you know the limitation of them

• Don't do marketing as those benchmarks
mean real user perceived performance
43

iPhone'5s' iPhone'5s'32,bit' CA15' CA7' Krait'400'
DMIPS/MHz' 7.47'' 5.70'' 2.71'' 1.67'' 2.46''
0.00''
1.00''
2.00''
3.00''
4.00''
5.00''
6.00''
7.00''
8.00''
DMIPS/MHz)
A7 Dhrystone
44

iPhone'5s'
iPhone'5s'32,
bit'
'CA15' CA7' Krait'400'
MFLOPS/GHz' 722' 723' 449' 119' 299'
0'
100'
200'
300'
400'
500'
600'
700'
800'
MFLOPS/GHz+
A7 linpack MFLOPS
45

iPhone'5s' iPhone'5s'32,bit' CA15' CA7' Krait'400'
CoreMark/MHz' 5.72'' 4.45'' 3.67'' 2.46'' 3.30''
0.00''
1.00''
2.00''
3.00''
4.00''
5.00''
6.00''
7.00''
CoreMark/MHz+
A7 CoreMark
46

Different items
• Example, GeekBench 3

• Arithmetic mean with different weight?
How?

• Good properties of geometric mean
47

Source code
• So far what we talked about are all
software with source code available, either
publicly/freely, e.g., Dhrystone or little
amount of $, e.g., SPEC CPU
48

• Benchmark scores/results usually depend
on compiler, complier ﬂags, processors, and
systems
49

Outline


• Future
50

Back to Android
• What kinds of Benchmarks are available, or used to
compare performance

• Apps with native benchmarks:Antutu, GeekBench

• Java apps, e.g., Quadrant

• Hybrid: with both native and Java, e.g.,AndEBench
and CF-Bench

• We also use SPEC CPU2000 and other stuff
internally
51

Ars Technica List
arrayOfPackageInfo[0]
=
new
PackageInfo("com.aurorasoftworks.quadrant.ui.standard",
false);

=
new
PackageInfo("com.aurorasoftworks.quadrant.ui.advanced",
false);

=
new
PackageInfo("com.aurorasoftworks.quadrant.ui.professional",
false);

=
new
PackageInfo("com.redlicense.benchmark.sqlite",
false);

=
new
PackageInfo("com.antutu.ABenchMark",
false);

=
new
PackageInfo("com.greenecomputing.linpack",
false);

=
new
PackageInfo("com.greenecomputing.linpackpro",
false);

=
new
PackageInfo("com.glbenchmark.glbenchmark27",
false);

=
new
false);

=
new
false);

=
new
PackageInfo("ca.primatelabs.geekbench2",
false);

=
new
PackageInfo("com.eembc.coremark",
false);

=
new
PackageInfo("com.flexycore.caffeinemark",
false);

=
new
PackageInfo("eu.chainfire.cfbench",
false);

=
new
PackageInfo("gr.androiddev.BenchmarkPi",
false);

=
new
PackageInfo("com.smartbench.twelve",
false);

=
new
PackageInfo("com.passmark.pt_mobile",
false);

=
new
PackageInfo("se.nena.nenamark2",
false);

=
new
PackageInfo("com.samsung.benchmarks",
false);

=
new
PackageInfo("com.samsung.benchmarks:db",
false);

=
new
PackageInfo("com.samsung.benchmarks:es1",
false);

=
new
PackageInfo("com.samsung.benchmarks:es2",
false);

=
new
PackageInfo("com.samsung.benchmarks:g2d",
false);

=
new
PackageInfo("com.samsung.benchmarks:fs",
false);

=
new
PackageInfo("com.samsung.benchmarks:ks",
false);

=
new
PackageInfo("com.samsung.benchmarks:cpu

!
!
CPU and Memory related: Quadrant, Antutu,
linpack, GeekBench, AndEBench (coremark),
CaffeineMark, Pi, PassMark, Samsung’s benchmark
52

Antutu 3.x
• CPU: integer, ﬂoating point

• memory: RAM

• Graphics: 2D, 3D

• I/O: Database, SD read, SD write

!
!
• What are you benchmarking

• What's you workload

• How to calculate scores
53

What on earth are they
doing?
• Actually no public available information

• But, with good enough background
knowledge and proper tools (we’ll talk
about these later), we can ﬁgure it out

• It turns out most of them are from the
BYTE nbench (http://en.wikipedia.org/wiki/
NBench)
54

AnTuTu
3.x
CPU
and
Memory
Tests
nbench item Used by Antutu Antutu part
Antutu
percentage on
progress bar Order nbench category
NUMERIC SORT yes Integer 27% 4 integer
STRING SORT yes RAM 1% 1 memory
BITFIELD yes RAM 1% 2 memory
FP EMULATION no
FOURIER yes floating 47% 7 floating point
ASSIGNMENT yes RAM 8% 3 memory
IDEA yes Integer 27% 5 integer
HUFFMAN yes Integer 34% 6 integer
NEURAL NET no
LU DECOMPOSITION no
55

More
close
look
▪ RAM
– String sort:
• string Heap sort: StrHeapSort()
• MoveMemory() à memmove()
– Bit Field:
• Bit field test: DoBitops()
– Assignment:
• Task Assignment test: DoAssignment()
▪ Integer
– Numeric sort:
• Numeric heap sort: NumHeapSort()
– IDEA:
• IDEA encryption and decryption: cipher_idea()
– Huffman:
• Huffman encoding
▪ Floating point:
– Fourier:
• Fourier transform: pow(), sin(), cos()
56

for(i=top; i>0; --i)!
{!
"strsift(optrarray,strarray,numstrings,0,i);!
!
"/* temp = string[0] */!
"tlen=*strarray;!
"MoveMemory((farvoid *)&temp[0], /* Perform exchange */!
" "(farvoid *)strarray,!
" "(unsigned long)(tlen+1));!
!
!
"/* string[0]=string[i] */!
"tlen=*(strarray+*(optrarray+i));!
"stradjust(optrarray,strarray,numstrings,0,tlen);!
"MoveMemory((farvoid *)strarray,!
" "(farvoid *)(strarray+*(optrarray+i)),!
!
"/* string[i]=temp */!
"tlen=temp[0];!
"stradjust(optrarray,strarray,numstrings,i,tlen);!
"MoveMemory((farvoid *)(strarray+*(optrarray+i)),!
" "(farvoid *)&temp[0],!
!
}
String Sort in NBench
• Sorts an array of strings
of arbitrary length

• Test memory movement
performance

• Non-sequential
performance of cache,
with added burden that
moves are byte-wide and
can occur on odd
address boundaries
57

Bit ﬁeld in NBench
• Executes 3 bit manipulation functions
• Exercises "bit twiddling“ performance. Travels through
memory bit-by-bit in a sequential fashion; different from sorts
in that data is merely altered in place
• Operations:
• Set: OR 1
• Clear: AND 0
• Toggle: XOR
• Set, clear: ToggleBitRun()
• Toggle: FlipBitRun()
static void ToggleBitRun(farulong *bitmap, /* Bitmap */
ulong bit_addr, /* Address of bits to set */
ulong nbits, /* # of bits to set/clr */
uint val) /* 1 or 0 */
{
unsigned long bindex; /* Index into array */
unsigned long bitnumb; /* Bit number */
!
while(nbits--)
{
#ifdef LONG64
bindex=bit_addr>>6; /* Index is number /64 */
bitnumb=bit_addr % 64; /* Bit number in word */
#else
bindex=bit_addr>>5; /* Index is number /32 */
bitnumb=bit_addr % 32; /* bit number in word */
#endif
if(val)
bitmap[bindex]|=(1L<<bitnumb);
else
bitmap[bindex]&=~(1L<<bitnumb);
bit_addr++;
}
return;
}
58

Assignment in NBench
• The test moves through
large integer arrays in both
row-wise and column-wise
fashion. Cache/memory
with good sequential
performance should see a
boost (memory is altered in
place -- no moving as in a
sort operation)

• Yes, basically, sequential
array assignment with some
kind of table look-ups
/*
** Step through rows. For each one that is not currently
** assigned, see if the row has only one zero in it. If so,
** mark that as an assigned row/col. Eliminate other zeros
** in the same column.
*/
for(i=0;i<ASSIGNROWS;i++)
{ numzeros=0;
for(j=0;j<ASSIGNCOLS;j++)
if(tableau[i][j]==0L)
if(assignedtableau[i][j]==0)
{ numzeros++;
selected=j;
}
if(numzeros==1)
{ numassigns++;
totnumassigns++;
assignedtableau[i][selected]=1;
for(k=0;k<ASSIGNROWS;k++)
if((k!=i) &&
(tableau[k][selected]==0))
assignedtableau[k][selected]=2;
}
}
59

Numeric Sort in NBench
• Sorts an array of long
integers with heap sort

• Generic integer
performance. Should
exercise non-sequential
performance of cache
(or memory if cache is
less than 8K). Moves 32-
bit longs at a time, so
16-bit processors will be
at a disadvantage
static void NumHeapSort(farlong *array,
ulong bottom, /* Lower bound */
ulong top) /* Upper bound */
{
ulong temp; /* Used to exchange elements */
ulong i; /* Loop index */
!
/*
** First, build a heap in the array
*/
for(i=(top/2L); i>0; --i)
NumSift(array,i,top);
!
/*
** Repeatedly extract maximum from heap and place it at the
** end of the array. When we get done, we'll have a sorted
** array.
*/
for(i=top; i>0; --i)
{ NumSift(array,bottom,i);
temp=*array; /* Perform
exchange */
*array=*(array+i);
*(array+i)=temp;
}
return;
60

static void cipher_idea(u16 in[4],!
" "u16 out[4],!
" "register IDEAkey Z)!
{!
register u16 x1, x2, x3, x4, t1, t2;!
/* register u16 t16;!
register u16 t32; */!
int r=ROUNDS;!
!
x1=*in++;!
x2=*in++;!
x3=*in++;!
x4=*in;!
!
do {!
"MUL(x1,*Z++);!
"x2+=*Z++;!
"x3+=*Z++;!
"MUL(x4,*Z++);!
!
"t2=x1^x3;!
"MUL(t2,*Z++);!
"t1=t2+(x2^x4);!
"MUL(t1,*Z++);!
"t2=t1+t2;!
!
"x1^=t1;!
"x4^=t2;!
!
"t2^=x2;!
"x2=x3^t1;!
"x3=t2;!
} while(--r);!
MUL(x1,*Z++);!
*out++=x1;!
*out++=x3+*Z++;!
*out++=x2+*Z++;!
MUL(x4,*Z);!
*out=x4;!
return;!
}
IDEA Encryption in NBench
• IDEA: a new block
cipher when nbench was
in development

• Moves through data
sequentially in 16-bit
chunks
61

Huffman in NBench
• Everybody knows Huffman code, right?

• A combination of byte operations, bit twiddling, and overall integer
manipulation
.....
/*
** Huffman tree built...compress the plaintext
*/
bitoffset=0L; /* Initialize bit offset */
for(i=0;i<arraysize;i++)
{
c=(int)plaintext[i]; /* Fetch character */
/*
** Build a bit string for byte c
*/
bitstringlen=0;
while(hufftree[c].parent!=-2)
{ if(hufftree[hufftree[c].parent].left==c)
bitstring[bitstringlen]='0';
else
bitstring[bitstringlen]='1';
c=hufftree[c].parent;
bitstringlen++;
}
.....
62

Fourier in NBench
• No, not FFT,

• Good measure of transcendental and trigonometric performance of FPU. Little array
activity, so this test should not be dependent of cache or memory architecture
static double thefunction(double x, /* Independent variable */!
" "double omegan, /* Omega * term */!
" "int select) /* Choose term */!
{!
/*!
** Use select to pick which function we call.!
*/!
switch(select)!
{!
"case 0: return(pow(x+(double)1.0,x));!
"case 1: return(pow(x+(double)1.0,x) * cos(omegan * x));!
"case 2: return(pow(x+(double)1.0,x) * sin(omegan * x));!
}
63

Neural Net in NBench
• A robust algorithm for
solving linear equations

• Small-array ﬂoating-point
test heavily dependent
on the exponential
function; less dependent
on overall FPU
performance
64

LU Decomposition in NBench
• LU Decomposition

• Yes, the LU decomposition
you learned in linear
algebra

• A ﬂoating-point test that
moves through arrays in
both row-wise and
column-wise fashion.
Exercises only fundamental
math operations (+, -, *, /)
65

GeekBench
• A cross-platform one

• The only publicly available one we could use to compare
Android, iOS, and other platforms

• Quite clearly described test items

• http://support.primatelabs.com/kb/geekbench/geekbench-3-
benchmarks

• Explaining how to interpret results

• http://support.primatelabs.com/kb/geekbench/interpreting-
geekbench-3-scores

• Source code available if you pay
66

Vellamo
• HTML5

• Metal: Dhrystone, Linpack, Branch-K, Stream
5.9, RamJam, Storage

• some are well-known; some are written
by Quic?

• Anyway, all of them are described at http://
www.quicinc.com/vellamo/test-descriptions/
67

CFBench
• Used by some people,‘cause

• Test both Java and native version

• its author is quite active in xda developer forum

• Some problems

• no good description of tests

• some code is wrong, e.g.,

• its Native Memory Read test is not testing memory
read,‘cause malloc()ed array is not initialized
68

Outline


• Future
69

How do we improve
benchmark
performance
70

• In the good old days, we have source code, we compile and run
benchmark programs

• In current Android ecosystem

• Usually we don’t have source

• Profiling: oprofile, perf, DS-5

• profiling sometimes doesn’t report real bottleneck
function, e.g., static functions usually are inlined and don’t
have symbol in shipped binaries

• binutils: nm, readelf, objdump, gdb

• Improving libraries, e.g., libc and libm, and runtime system, e.g.,
JIT of Dalvik, used by those benchmarks
71

Antutu 3.x
• memmove() in bionic --> bcopy() in C

• rewrite with NEON assembly code

• pow(), sin(), cos() in C

• rewrite them with assembly
72

bcopy() in bionic
• MoveMemory() in nbench
-> memmove() in bionic -
> bcopy() in bionic

• memcpy() assembly in
bionic and there are
processor speciﬁc ones
(CA9, CA15, Krait).
NEON (vector load/
store) helps

• not for bcopy()
in bionic/libc/bionic/memmove.c
!
void *memmove(void *dst, const void *src, size_t n)
{
const char *p = src;
char *q = dst;
/* We can use the optimized memcpy if the source and destination
* don't overlap.
*/
if (__builtin_expect(((q < p) && ((size_t)(p - q) >= n))
|| ((p < q) && ((size_t)(q - p) >= n)), 1)) {
return memcpy(dst, src, n);
} else {
bcopy(src, dst, n);
return dst;
}
}
in bionic/libc/string/bcopy.c
/*
* Copy a block of memory, handling overlap.
* This is the routine that actually implements
* (the portable versions of) bcopy, memcpy, and memmove.
*/
#ifdef MEMCOPY
void *
memcpy(void *dst0, const void *src0, size_t length)
#else
#ifdef MEMMOVE
void *
memmove(void *dst0, const void *src0, size_t length)
#else
void
bcopy(const void *src0, void *dst0, size_t length)
#endif
#endif
{
.....
73

Antutu 3.x
• For people with source code

• Selection of toolchain and compiler options
may cause huge difference, e.g., bit ﬁeld

• Some version of x86 binary for Antutu
3.x was compiled with Intel, bit-by-bit
operations turned in word-wide (32-bit)
operations, and the speed up is about 70x
faster
74

Stream copy usually turned into
memcpy()
75

remote gdb
1. get /system/bin/app_process and /system/bin/linker of the target system and necessary
shared libraries, e.g., /data/data/eu.chainfire.cfbench/lib/libCFBench.so

• adb pull /system/bin/app_process!
• adb pull /system/bin/linker lib/armeabi-v7a/!
• adb pull /data/data/eu.chainfire.cfbench/lib/libCFBench.so lib/
armeabi-v7a/!
2. arm-linux-gnueabi-gdb ./app_process

3. on the target device, attach gdbserver to the running process you wanna debug

• ./gdbserver --attach :5039 3484

4. set shared library search path

• (gdb) set solib-search-path /Users/freedom/tmp/cfbench/lib/armeabi-v7a

5. ‘adb forward tcp:5039 tcp:5039’ and set remote target

• (gdb) target remote :5039

6. you can set breakpoints, print backtrace, disassemble, etc.
76

• (gdb) b Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned

• (gdb) disassemble
Dump of assembler code for function Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned:

0x74b65848 <+0>: stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, lr}

=> 0x74b6584c <+4>: bl 0x74b654ac <loadLib>

0x74b65850 <+8>: mov.w r0, #1048576 ; 0x100000

0x74b65854 <+12>: blx 0x74b65358

0x74b65858 <+16>: movs r6, #0

0x74b6585a <+18>: movw r9, #9999 ; 0x270f

0x74b6585e <+22>: mov r8, r0

0x74b65860 <+24>: bl 0x74b6547c <getTickCount>

0x74b65864 <+28>: add.w r5, r8, #1048576 ; 0x100000

0x74b65868 <+32>: mov r10, r0

0x74b6586a <+34>: mov r3, r8

0x74b6586c <+36>: ldr.w r2, [r3], #4

0x74b65870 <+40>: cmp r3, r5

0x74b65872 <+42>: add r4, r2

0x74b65874 <+44>: bne.n 0x74b6586c <Java_eu_chainfire_cfbench_BenchNative_benchMemReadAligned+36>

0x74b65876 <+46>: bl 0x74b6547c <getTickCount>

0x74b6587a <+50>: adds r6, #1

0x74b6587c <+52>: rsb r7, r10, r0

0x74b65880 <+56>: cmp r7, r9

77

Quadrant
• Written in Java

• CPU: Not really testing CPU

• Memory: proﬁling shows that memcpy() is
heavily in used

• What can we do

• optimized JIT part of DVM
78

What other possible
ways?
• binary translation during

• installation time

• run time
79

Wrap-up
• Popular CPU and Memory benchmarks on
Android mostly don’t reﬂect real CPU
performance

• We know CPU performance != System
performance != user-perceived
performance

• There is always room for improvement
80

Recent progress
• EEMBC’s AndEBench 2.0 is under development (http://
www.eembc.org/press/pressrelease/130128.html)

• Qualcomm asked BDTi to develop new benchmark
(http://www.qualcomm.com/media/blog/2013/08/16/
mobile-benchmarking-turning-corner-user-
experience).

• Samsung with other vendors launched MobileBench
Consortium last year

• Antutu is still growing
82

廣告
• MediaTek joined
linaro.org last month

• linaro.org is a NPO
working on open source
Linux/Android related
stuff for ARM-based
SoCs

• So MTK is getting more
open recently

• And, it’s looking for
open source engineers

• Talk to guys at MTK
booth or me

• There are more non-
open source jobs
84

Some References to Understand
Performance Benchmark
• Raj Jain,“The Art of Computer Systems Performance
Analysis:Techniques for Experimental Design,
Measurement, Simulation, and Modeling”,Wiley, 1991

• Quantitative Approach

• A good SPEC introduction article, http://mrob.com/
pub/comp/benchmarks/spec.html

• Kaivalya M. Dixit,“Overview of the SPEC
Benchmarks,” http://people.cs.uchicago.edu/~chliu/
doc/benchmark/chapter9.pdf
86

Basic system parameters

------------------------------------------------------------------------------

Host OS Description Mhz tlb cache mem scal

pages line par load

bytes

--------- ------------- ----------------------- ---- ----- ----- ------ ----

localhost Linux 3.4.5-g armv7l-linux-gnu 1696 7 64 4.4700 1

!
Processor, Processes - times in microseconds - smaller is better

------------------------------------------------------------------------------

Host OS Mhz null null open slct sig sig fork exec sh

call I/O stat clos TCP inst hndl proc proc proc

--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----

localhost Linux 3.4.5-g 1696 0.49 0.67 2.54 5.95 8.52 0.67 5.05 876. 1668 4654

!
Basic integer operations - times in nanoseconds - smaller is better

-------------------------------------------------------------------

Host OS intgr intgr intgr intgr intgr

bit add mul div mod

--------- ------------- ------ ------ ------ ------ ------

localhost Linux 3.4.5-g 1.0700 0.1100 3.4000 90.5 14.8

!
Basic float operations - times in nanoseconds - smaller is better

-----------------------------------------------------------------

87

Context switching - times in microseconds - smaller is better

-------------------------------------------------------------------------

Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K

ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw

--------- ------------- ------ ------ ------ ------ ------ ------- -------

localhost Linux 3.4.5-g 8.9700 4.9000 6.1400 12.3 7.68000 57.6

!
*Local* Communication latencies in microseconds - smaller is better

---------------------------------------------------------------------

Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP

ctxsw UNIX UDP TCP conn

--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----

localhost Linux 3.4.5-g 8.970 17.6 23.9 47.5 71.3 357.

!
File & VM system latencies in microseconds - smaller is better

-------------------------------------------------------------------------------

Host OS 0K File 10K File Mmap Prot Page 100fd

Create Delete Create Delete Latency Fault Fault selct

--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----

localhost Linux 3.4.5-g 700.0 1.259 2.55270 3.048

!
*Local* Communication bandwidths in MB/s - bigger is better

-----------------------------------------------------------------------------

Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem

88

PARSEC
content
• Blackscholes

This
applica0on
is
an
Intel
RMS
benchmark.
It
calculates
the
prices
for
a
por|olio
of
European
op0ons

analy0cally
with
the
Black-‐Scholes
par1al
differen1al
equa1on
(PDE).
There
is
no
closed-‐form
expression
for
the
Black-‐
Scholes
equa0on
and
as
such
it
must
be
computed
numerically.

• Bodytrack
This
computer
vision
applica0on
is
an
Intel
RMS
workload
which
tracks
a
human
body
with
mul1ple
cameras

through
an
image
sequence.
This
benchmark
was
included
due
to
the
increasing
significance
of
computer
vision

algorithms
in
areas
such
as
video
surveillance,
character
anima0on
and
computer
interfaces.

• Canneal

This
kernel
was
developed
by
Princeton
University.
It
uses
cache-‐aware
simulated
annealing
(SA)
to
minimize

the
rou1ng
cost
of
a
chip
design.
Canneal
uses
fine-‐grained
parallelism
with
a
lock-‐free
algorithm
and
a
very
aggressive

synchroniza0on
strategy
that
is
based
on
data
race
recovery
instead
of
avoidance.

• Dedup
This
kernel
was
developed
by
Princeton
University.
It
compresses
a
data
stream
with
a
combina1on
of
global
and

local
compression
that
is
called
'deduplica1on'.
The
kernel
uses
a
pipelined
programming
model
to
mimic
real-‐world

implementa0ons.
The
reason
for
the
inclusion
of
this
kernel
is
that
deduplica0on
has
become
a
mainstream
method
for

new-‐genera0on
backup
storage
systems.

• Facesim
This
Intel
RMS
applica0on
was
originally
developed
by
Stanford
University.
It
computes
a
visually
realis1c

anima1on
of
the
modeled
face
by
simula1ng
the
underlying
physics.
The
workload
was
included
in
the
benchmark
suite

because
an
increasing
number
of
anima0ons
employ
physical
simula0on
to
create
more
realis0c
effects.

• Ferret

This
applica0on
is
based
on
the
Ferret
toolkit
which
is
used
for
content-‐based
similarity
search.
It
was
developed

by
Princeton
University.
The
reason
for
the
inclusion
in
the
benchmark
suite
is
that
it
represents
emerging
next-‐
genera0on
search
engines
for
non-‐text
document
data
types.
In
the
benchmark,
we
have
configured
the
Ferret
toolkit
for

image
similarity
search.
Ferret
is
parallelized
using
the
pipeline
model.
89

PARSEC
content
• Fluidanimate

This
Intel
RMS
applica0on
uses
an
extension
of
the
Smoothed
Par0cle
Hydrodynamics
(SPH)
method
to

simulate
an
incompressible
fluid
for
interac1ve
anima1on
purposes.
It
was
included
in
the
PARSEC
benchmark
suite

because
of
the
increasing
significance
of
physics
simula0ons
for
anima0ons.

• Freqmine

This
applica0on
employs
an
array-‐based
version
of
the
FP-‐growth
(Frequent
PaMern-‐growth)
method
for

Frequent
Itemset
Mining
(FIMI).
It
is
an
Intel
RMS
benchmark
which
was
originally
developed
by
Concordia
University.

Freqmine
was
included
in
the
PARSEC
benchmark
suite
because
of
the
increasing
use
of
data
mining
techniques.

• Raytrace

The
Intel
RMS
applica0on
uses
a
version
of
the
raytracing
method
that
would
typically
be
employed
for
real-‐
0me
anima0ons
such
as
computer
games.
It
is
op0mized
for
speed
rather
than
realism.
The
computa0onal
complexity
of

the
algorithm
depends
on
the
resolu0on
of
the
output
image
and
the
scene.

• Streamcluster

This
RMS
kernel
was
developed
by
Princeton
University
and
solves
the
online
clustering
problem.

Streamcluster
was
included
in
the
PARSEC
benchmark
suite
because
of
the
importance
of
data
mining
algorithms
and
the

prevalence
of
problems
with
streaming
characteris0cs.

• Swap1ons

The
applica0on
is
an
Intel
RMS
workload
which
uses
the
Heath-‐Jarrow-‐Morton
(HJM)
framework
to
price
a

porRolio
of
swap1ons.
Swap0ons
employs
Monte
Carlo
(MC)
simula0on
to
compute
the
prices.

• Vips

This
applica0on
is
based
on
the
VASARI
Image
Processing
System
(VIPS)
which
was
originally
developed
through

several
projects
funded
by
European
Union
(EU)
grants.
The
benchmark
version
is
derived
from
a
print
on
demand
service

that
is
offered
at
the
Na0onal
Gallery
of
London,
which
is
also
the
current
maintainer
of
the
system.
The
benchmark

includes
fundamental
image
opera0ons
such
as
an
affine
transforma0on
and
a
convolu0on.

• X264
90

Understanding Android Benchmarks

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie Understanding Android Benchmarks

Ähnlich wie Understanding Android Benchmarks (20)

Mehr von Koan-Sin Tan

Mehr von Koan-Sin Tan (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Understanding Android Benchmarks