4. H.265/HEVC
FINALIZED
JANUARY
25,
2013
NOTABLE
CHANGES
FROM
H.264
! H.264’s
16x16
macroblocks
replaced
with
64x64
CUs
and
QuadTrees
‒ Coding
QuadTree
can
be
recursively
split
down
to
8x8
blocks
‒ At
all
levels,
the
coding
blocks
can
chose
inter
or
intra
predic]on
‒ The
final
coding
blocks
can
be
further
split
‒ The
residual
is
signaled
in
a
second
QuadTree
which
can
have
more
depth
than
the
coding
QT
! Inter
predic]on
has
more
accuracy
‒ HPEL
filter
has
8-‐taps,
QPEL
has
7-‐taps.
(H.264
has
6-‐tap
HPEL
and
avg
QPEL)
‒ Merge
candidates
replace
direct
and
skip
H.264
modes
‒ AMVP
allows
mo]on
predic]on
to
be
selected
from
a
list,
in
H.264
it
was
en]rely
implicit
4
|
PRESENTATION
TITLE
|
NOVEMBER
19,
2013
|
CONFIDENTIAL
5. H.265/HEVC
FINALIZED
JANUARY
25,
2013
NOTABLE
CHANGES
FROM
H.264
! More
intra
predic]ons
‒ DC
and
planar
modes,
similar
to
H.264
‒ 33
angular
predic]ons
with
emphasis
on
near-‐ver]cal
and
near-‐horizontal
angles
‒ 35
predic]ons
in
total
(for
all
block
sizes
from
32x32
to
4x4)
but
few
special
cases
! Sample
Adap]ve
Offset
loop
filter
for
reduced
compression
ar]facts
5
|
PRESENTATION
TITLE
|
NOVEMBER
19,
2013
|
CONFIDENTIAL
6. H.265/HEVC
PARALLELIZATION
CONSIDERATIONS
NOTABLE
CHANGES
FROM
H.264
! WaveFront
Parallel
Processing
‒ Each
row
of
largest
CU
blocks
can
be
encoded
in
parallel,
with
a
two
block
lag
to
row
above
‒ The
CABAC
state
of
block
2
is
communicated
to
block
0
of
row
below
‒ <1%
loss
of
compression
efficiency,
much
more
efficient
than
slices
or
]les
! Tiles
–
split
each
frame
into
regular
rectangular
parts,
encode
each
in
parallel
! Deblocking
only
on
8x8
boundaries,
and
beler
ordering
of
opera]ons
6
|
PRESENTATION
TITLE
|
NOVEMBER
19,
2013
|
CONFIDENTIAL
7. H.265/HEVC
PARALLELIZATION
CONSIDERATIONS
THE
FINE
PRINT
! Larger
block
sizes
reduce
the
effec]veness
of
frame
parallelism
‒ Only
a
quarter
of
the
available
block
rows
as
H.264
for
the
same
resolu]on
video
‒ Aner
accoun]ng
for
deblocking,
and
SAO
there
is
a
three
row
(192
line)
lag
between
references
‒ Wavefront
analysis
or
]les
must
be
used
in
conjunc]on
with
frame
parallelism
to
make
up
for
this
‒ High
percentage
of
B
frames
to
P
frames
alleviates
this
bolleneck
! Large
blocks
increase
serial
opera]ons,
add
longer
data
dependencies
‒ Each
CU
in
the
quad-‐tree
must
be
analyzed
in
Z-‐scan
order
‒ Since
each
CU
can
chose
intra,
all
prior
blocks
must
generate
recon
pixels
–
no
shortcuts
‒ Varia]ons
in
CU
encode
]mes
reduce
the
effec]veness
of
wavefront
analysis
by
causing
stalls
7
|
PRESENTATION
TITLE
|
NOVEMBER
19,
2013
|
CONFIDENTIAL
9. X265
–
A
SHORT
HISTORY
! x265
Consor]um
founded
in
April
of
2013
‒ Dual
commercial
and
GPLv2+
license
‒ Development
primarily
centered
in
Chennai,
India
with
contribu]ons
from
China
and
US
‒ Started
from
the
HEVC
reference
encoder
(HM),
less
than
half
of
HM
source
remains
today
‒ Achieved
1080p
15fps
in
June
‒ Public
announcement
and
first
open
source
release
in
July
! Op]miza]ons
‒ WPP
wavefront
CTU
analysis
and
frame
parallelism
‒ Compiler
intrinsic
SIMD
based
performance
primi]ves
‒ Hand-‐wrilen
assembly
performance
primi]ves
‒ Data
flow
improvements,
early
outs,
RDO
reduc]ons
! Today
‒ 1080p@30fps
or
720p@200fps
on
16-‐core
SandyBridge
Xeon
9
|
PRESENTATION
TITLE
|
NOVEMBER
19,
2013
|
CONFIDENTIAL
10. X265
–
A
SHORT
HISTORY
! Ecosystem
‒ Licensed
to
reuse
x264
source
code
and
algorithms
‒ Open
development
on
mailing
list
and
IRC
‒ Public
repositories
on
Bitbucket
and
VideoLan.org
‒ Integra]on
into
VLC,
libav,
ffmpeg,
and
Handbrake
in
various
stages
of
comple]on
! x264
feature
adop]on
‒ Lookahead
/
slicetype
decision
and
scene
cut
detec]on
‒ Mo]on
es]ma]on
and
bitcost
func]ons
‒ CLI
interface
and
public
C
interface
‒ Assembly
primi]ves
for
SAD,
SATD,
SSD,
etc
‒ ABR
and
CRF
rate
control
–
VBV
adop]on
in
progress
by
O/S
contributor
! It
took
eight
years
for
x264
to
dominate
H.264
encoding
market
‒ We
would
like
to
achieve
dominance
in
the
HEVC
market
sooner
10
|
PRESENTATION
TITLE
|
NOVEMBER
19,
2013
|
CONFIDENTIAL
12. GPU
CONSIDERATIONS
A
SAD
HISTORY
! Historically,
GPUs
have
been
poor
for
video
encoding
‒ Intra
predic]on
requires
blocks
above
and
to
the
len
to
be
fully
encoded
and
decoded
‒ Inter
predic]on
requires
blocks
above
and
to
the
len
to
be
fully
analyzed
‒ Rate
distor]on
op]miza]ons
require
all
blocks
to
be
encoded
in
scan
order
‒ Together,
these
dependencies
severely
limit
the
amount
of
parallelism
that
can
be
exposed
to
the
GPU
! Encoder
data
dependencies
are
complex
‒ Copying
data
to
and
from
GPU
device
memory
generally
outweighs
any
performance
improvements
‒ Even
zero
copy
memory
is
insufficient,
the
CPU
and
GPU
must
share
structures
at
full
speed
! Previous
alempts
at
GPU
encoding
take
short
cuts
‒ One
can
ignore
some
of
these
dependencies
at
the
cost
of
compression
efficiency
and
quality
‒ In
x264,
we
only
used
the
GPU
for
lookahead
analysis
that
has
no
intra
and
RDO
dependencies
12
|
PRESENTATION
TITLE
|
NOVEMBER
19,
2013
|
CONFIDENTIAL
13. APU
CONSIDERATIONS
A
WELL
BALANCED
COMPUTE
PROCESSOR
! Heterogeneous
architecture
‒ GPU
compute
units
can
perform
high
bandwidth
opera]ons
and
highly
parallel
opera]ons
‒ CPU
performs
necessary
serial
and
logis]cal
opera]ons
‒ CPU
and
GPU
can
see
each
other’s
memory
! x265
opportunity
‒ Via
WPP
and
frame
parallelism
we
can
expose
two
dozen
parallel
CU
blocks
to
be
encoded
‒ Each
parallel
CU
block
requires
recursive
analysis
‒ Control
must
transfer
between
the
CPU
and
GPU
many
]mes
to
complete
analysis
‒ GPU
performs
all
cost
es]mates
for
inter
and
inter
compression,
loop
filters,
and
pixel
weigh]ng
‒ CPU
makes
QT
split
and
encode
decisions,
entropy
encoding,
and
dependency
tracking
‒ Many
CUs
can
be
busy
on
the
GPU
at
once,
only
four
may
use
the
CPU
cores
at
a
]me.
‒ Making
use
the
GPU
compute
units
with
minimal
CPU
overhead
is
the
key
13
|
PRESENTATION
TITLE
|
NOVEMBER
19,
2013
|
CONFIDENTIAL