Дмитрий Вовк: Векторизация кода под мобильные платформы

CODE
VECTORIZATION

for
mobile
devices

by
Dmitriy
Vovk

Hardware

•  Typical
hardware
found
in
modern
mobile

devices:

–  ARMv7
architecture

–  Cortex
A8Cortex
A9Custom
cores
(Krait,
SwiN)

–  800
–
1500
MHz

–  1-‐4
cores

–  Thumb-‐2
instrucXons
set

–  VFPv3

–  NEON,
opXonal
for
Cortex
A9.
Nvidia
Tegra
2
has

no
NEON
support

NEON

•  NEON
is
a
general
purpose
SIMD
engine

designed
by
ARM
for
ARM
processor

architecture

•  16
registers,
128
bit
wide
each.
Supports

operaXons
on
8,
16,
32
and
64
bits
integers

and
32
bits
ﬂoat
values

NEON

•  NEON
can
be
used
for:

–  SoNware
geometry
instancing;

–  Skinning
on
ES
1.1;

–  As
a
general
vertex
processor;

–  Other,
typical,
applicaXons
for
SIMD.

NEON

•  Some
uniﬁed
shader
architectures,
like

popular
ImaginaXon
Technologies
USSE1

(PowerVR
SGX
530-‐545)
are
scalar,
NEON
is

vector
by
nature.
Move
your
vertex
processing

to
CPU
from
GPU
to
speedup
calculaXons*

•  ???????

•  PROFIT!!!111

•  *NOTE.
That
doesn’t
apply
to
USSE2
hardware

NEON

•  The
weakest
side
of
mobile
GPUs
is
a
ﬁll
rate.

Fill
rate
is
quickly
killed
by
blending.
2D
games

are
heavy
on
this.
PowerVR
USSE
engine

doesn’t
care
what
to
do
–
vertex
or
fragments

processing.
Moving
you
vertex
processing
to

CPU
(NEON)
will
leave
some
room
space
for

fragment
processing.

NEON

•  There
are
3
ways
to
use
NEON
vectorizaXon
in

your
code:

1.  Intrinsics

2.  Handwrijen
NEON
assembly

3.  AutovectorizaXon
by
compiler.
–mllvm
–
vectorize
–mllvm
–bb-‐vectorize-‐aligned-‐only

compiler
ﬂags
for
LLVM.
-‐Bree-‐vectorizer-‐
verbose=4
-‐mfpu=neon
-‐funsafe-‐math-‐
opGmizaGons
-‐Bree-‐vectorize
for
GCC

Measurements

•  Intrinsics:

Measurements

•  Assembly
:

Measurements

•  Summary:

Running
'me,
ms
CPU
usage,
%

Intrinsics
2764
19

Assembly
3664
20

FPU
6209
25-‐28

FPU
autovectorized
5028
22-‐24

•  Intrinsics
got
me
25%
speedup
over
assembly.

•  Note
that
speed
of
intrinsics
code
vary
from

compiler
to
compiler.

NEON

•  Intrinsics
advantages
over
assembly:

–  Higher
level
code;

–  No
need
to
manage
registers;

–  You
can
vectorize
basic
blocks
and
build
soluXon

to
every
new
problem
with
this
blocks.
In
contrast

to
assembly
–
you
have
to
solve
each
new

problem
from
scratch;

NEON

•  Assembly
advantages
over
intrinsics:

–  Code
generated
from
intrinsics
vary
from
compiler

to
compiler
and
can
give
you
really
big
diﬀerence

in
speed.
Assembly
code
will
always
be
the
same.

Code

void
Update()
{

GLKMatrix4
modelviewMat
=
{

1,
0,
0,
0,

0,
1,
0,
0,

0,
0,
1,
0,

0,
0,
0,
1
};

const
float
Y_DELTA
=
420.0f
/
QUADS_COUNT;

for
(int
i
=
0;
i
<
QUADS_COUNT
*
VERTS_PER_QUAD;
i
+=
VERTS_PER_QUAD)
{

modelviewMat.m[12]
=
random()
%
260;

modelviewMat.m[13]
=
Y_DELTA
;

#ifdef
ASM

CalculateSpriteVertsWorldPos((float32x4x4_t*)proj.m,
(float32x4x4_t*)modelviewMat.m,
(float32x4_t*)&data[i
+
0].pos,
+

1].pos,
+
2].pos,
+
3].pos);

#else

float32x4x4_t
modelviewProj;

Matrix4ByMatrix4((float32x4x4_t*)proj.m,
(float32x4x4_t*)modelviewMat.m,
&modelviewProj);

for
(int
j
=
0;
j
<
4;
++j)
{

Matrix4ByVec4(&modelviewProj,
(float32x4_t*)&squareVerXces[j],
+
j].pos);

}

#endif

}

glBindBuffer(GL_ARRAY_BUFFER,
vertexBuffer);

glBufferData(GL_ARRAY_BUFFER,
sizeof(data),
data,
GL_STREAM_DRAW);

}

Code

__ajribute__((always_inline))
void
Matrix4ByVec4(const

float32x4x4_t*
__restrict__
mat,
const
float32x4_t*

__restrict__
vec,
float32x4_t*
__restrict__
result)

{

(*result)
=
vmulq_n_f32((*mat).val[0],
(*vec)[0]);

(*result)
=
vmlaq_n_f32((*result),
(*mat).val[1],
(*vec)[1]);

(*result)
=
(*mat).val[2],
(*vec)[2]);

(*result)
=
(*mat).val[3],
(*vec)[3]);

}

Code

__ajribute__((always_inline))
void
Matrix4ByMatrix4(const
float32x4x4_t*
__restrict__
m1,
const
float32x4x4_t*
__restrict__
m2,

float32x4x4_t*
__restrict__
r)

{

#ifdef
INTRINSICS

(*r).val[0]
=
vmulq_n_f32((*m1).val[0],
vgetq_lane_f32((*m2).val[0],
0));

(*r).val[1]
=
0));

(*r).val[2]
=
0));

(*r).val[3]
=
0));

(*r).val[0]
=
vmlaq_n_f32((*r).val[0],
(*m1).val[1],
1));

(*r).val[1]
=
(*m1).val[1],
1));

(*r).val[2]
=
(*m1).val[1],
1));

(*r).val[3]
=
(*m1).val[1],
1));

(*r).val[0]
=
(*m1).val[2],
2));

(*r).val[1]
=
(*m1).val[2],
2));

(*r).val[2]
=
(*m1).val[2],
2));

(*r).val[3]
=
(*m1).val[2],
2));

(*r).val[0]
=
(*m1).val[3],
3));

(*r).val[1]
=
(*m1).val[3],
3));

(*r).val[2]
=
(*m1).val[3],
3));

(*r).val[3]
=
(*m1).val[3],
3));

}

Code

__asm__
volaXle

"vmla.f32
q12,
q11,
d1[1]nt"

"vmla.f32
q10,
q13,
d4[1]nt"

(

"vmla.f32
q13,
q11,
d3[1]nt"

"vmla.f32
q10,
q14,
d5[0]nt"

"vldmia
%6,
{
q0-‐q3
}
nt"

"vmla.f32
q14,
q11,
d5[1]nt"

"vmla.f32
q10,
q15,
d5[1]nt"

"vldmia
%0,
{
q8-‐q11
}nt"

"vmla.f32
q15,
q11,
d7[1]nt"

"vmla.f32
q11,
q13,
d6[1]nt"

"vmul.f32
q12,
q8,
d0[0]nt"

"vldmia
%1,
{
q0-‐q3
}
nt"

"vmla.f32
q11,
q14,
d7[0]nt"

"vmul.f32
q13,
q8,
d2[0]nt"

"vmla.f32
q11,
q15,
d7[1]nt"

"vmul.f32
q14,
q8,
d4[0]nt"

"vmul.f32
q8,
q12,
d0[0]nt"

"vmul.f32
q15,
q8,
d6[0]nt"

"vmul.f32
q9,
q12,
d2[0]nt"

"vstmia
%2,
{
q8
}nt"

"vmul.f32
q10,
q12,
d4[0]nt"

"vstmia
%3,
{
q9
}nt"

"vmla.f32
q12,
q9,
d0[1]nt"

"vmul.f32
q11,
q12,
d6[0]nt"

"vstmia
%4,
{
q10
}nt"

"vmla.f32
q13,
q9,
d2[1]nt"

"vstmia
%5,
{
q11
}"

"vmla.f32
q14,
q9,
d4[1]nt"

"vmla.f32
q8,
q13,
d0[1]nt"

"vmla.f32
q15,
q9,
d6[1]nt"

"vmla.f32
q8,
q14,
d1[0]nt"

:

"vmla.f32
q8,
q15,
d1[1]nt"

:
"r"
(proj),
"r"
(squareVerXces),
"r"
(v1),

"vmla.f32
q12,
q10,
d1[0]nt"

"r"
(v2),
"r"
(v3),
"r"
(v4),
"r"
(modelView)

"vmla.f32
q13,
q10,
d3[0]nt"

"vmla.f32
q9,
q13,
d2[1]nt"

:
"memory",
"q0",
"q1",
"q2",
"q3",

"vmla.f32
q14,
q10,
d5[0]nt"

"vmla.f32
q9,
q14,
d3[0]nt"
"q8",
"q9",
"q10",
"q11",
"q12",
"q13",

"q14",
"q15"

"vmla.f32
q15,
q10,
d7[0]nt"

"vmla.f32
q9,
q15,
d3[1]nt"

);

Docs

•  For
detailed
explanaXon
on
intrinsics
assembly
see:

hjp://infocenter.arm.com/help/index.jsp?
topic=/com.arm.doc.dui0491e/CIHJBEFE.html

Contact
me

hjp://www.linkedin.com/in/dvovk/

hjp://nukecode.blogspot.com/

Дмитрий Вовк: Векторизация кода под мобильные платформы

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Дмитрий Вовк: Векторизация кода под мобильные платформы

Ähnlich wie Дмитрий Вовк: Векторизация кода под мобильные платформы (20)

Mehr von DevGAMM Conference

Mehr von DevGAMM Conference (20)

Дмитрий Вовк: Векторизация кода под мобильные платформы