Краткое рассмотрение ARMv7 архитектуры, и её особенностей. Кратко основное о деталях реализации ARMv7 ядер. Более детально про NEON – что, зачем и практическое применение.
2. Hardware
• Typical
hardware
found
in
modern
mobile
devices:
– ARMv7
architecture
– Cortex
A8Cortex
A9Custom
cores
(Krait,
SwiN)
– 800
–
1500
MHz
– 1-‐4
cores
– Thumb-‐2
instrucXons
set
– VFPv3
– NEON,
opXonal
for
Cortex
A9.
Nvidia
Tegra
2
has
no
NEON
support
3. NEON
• NEON
is
a
general
purpose
SIMD
engine
designed
by
ARM
for
ARM
processor
architecture
• 16
registers,
128
bit
wide
each.
Supports
operaXons
on
8,
16,
32
and
64
bits
integers
and
32
bits
float
values
4. NEON
• NEON
can
be
used
for:
– SoNware
geometry
instancing;
– Skinning
on
ES
1.1;
– As
a
general
vertex
processor;
– Other,
typical,
applicaXons
for
SIMD.
5. NEON
• Some
unified
shader
architectures,
like
popular
ImaginaXon
Technologies
USSE1
(PowerVR
SGX
530-‐545)
are
scalar,
NEON
is
vector
by
nature.
Move
your
vertex
processing
to
CPU
from
GPU
to
speedup
calculaXons*
• ???????
• PROFIT!!!111
• *NOTE.
That
doesn’t
apply
to
USSE2
hardware
6. NEON
• The
weakest
side
of
mobile
GPUs
is
a
fill
rate.
Fill
rate
is
quickly
killed
by
blending.
2D
games
are
heavy
on
this.
PowerVR
USSE
engine
doesn’t
care
what
to
do
–
vertex
or
fragments
processing.
Moving
you
vertex
processing
to
CPU
(NEON)
will
leave
some
room
space
for
fragment
processing.
7. NEON
• There
are
3
ways
to
use
NEON
vectorizaXon
in
your
code:
1. Intrinsics
2. Handwrijen
NEON
assembly
3. AutovectorizaXon
by
compiler.
–mllvm
–
vectorize
–mllvm
–bb-‐vectorize-‐aligned-‐only
compiler
flags
for
LLVM.
-‐Bree-‐vectorizer-‐
verbose=4
-‐mfpu=neon
-‐funsafe-‐math-‐
opGmizaGons
-‐Bree-‐vectorize
for
GCC
11. Measurements
• Summary:
Running
'me,
ms
CPU
usage,
%
Intrinsics
2764
19
Assembly
3664
20
FPU
6209
25-‐28
FPU
autovectorized
5028
22-‐24
• Intrinsics
got
me
25%
speedup
over
assembly.
• Note
that
speed
of
intrinsics
code
vary
from
compiler
to
compiler.
12. NEON
• Intrinsics
advantages
over
assembly:
– Higher
level
code;
– No
need
to
manage
registers;
– You
can
vectorize
basic
blocks
and
build
soluXon
to
every
new
problem
with
this
blocks.
In
contrast
to
assembly
–
you
have
to
solve
each
new
problem
from
scratch;
13. NEON
• Assembly
advantages
over
intrinsics:
– Code
generated
from
intrinsics
vary
from
compiler
to
compiler
and
can
give
you
really
big
difference
in
speed.
Assembly
code
will
always
be
the
same.