Trends and Recent Developments in Video Coding Standardization

Trends and Recent Developments in Video Coding
Standardization
ICME 2018 Tutorial, San Diego, 23.07.2018
Jens-Rainer Ohm Mathias Wien
Institute of Communication Engineering Institute of Imaging and Computer Vision
RWTH Aachen University, Germany RWTH Aachen University, Germany
ohm@ient.rwth-aachen.de wien@lfb.rwth-aachen.de

Trends and Recent Developments in Video Coding Standardization | Tutorial at ICME 2018 | San Diego, CA, USA |
Jens-Rainer Ohm and Mathias Wien | RWTH Aachen University | Institut für Nachrichtentechnik | Lehrstuhl für Bildverarbeitung | 23.07.2018
2
1. Introduction and history of video coding standardization (Jens)
2. Source formats and resolutions (Mathias)
3. State of the art in video compression (Mathias)
4. Versatile Video Coding (Jens)
5. Exploratory trends and perspectives (Jens)
6. Coding tools for multi-camera captures (Jens)
7. Summary and outlook
Outline

Part I: Introduction and history of video coding
standardization
ICME 2018 Tutorial: Trends and Recent Developments in Video Coding Standardization

4
Video coding standardization organisations
• ISO/IEC MPEG = “Moving Picture Experts Group”
(ISO/IEC JTC 1/SC 29/WG 11 = International Standardization Organization and International Electrotechnical Commission,
Joint Technical Committee 1, Subcommittee 29, Working Group 11)
• ITU-T VCEG = “Video Coding Experts Group”
(ITU-T SG16/Q6 = International Telecommunications Union – Telecommunications Standardization Sector (ITU-T,
a United Nations Organization, formerly CCITT),
Study Group 16, Working Party 3, Question 6)
• JVT = “Joint Video Team” collaborative team of MPEG & VCEG, responsible for developing AVC
(discontinued in 2009)
• JCT-VC = “Joint Collaborative Team on Video Coding” team of MPEG & VCEG , responsible for
developing HEVC (established January 2010)
• JVET = “Joint Video Experts Team” exploring potential for new technology beyond HEVC (established Oct.
2015 as Joint Video Exploration Team, renamed Apr. 2018)

5
History of international video coding standardization (1985  2020)
H.263/+/++
(1995-2000+)
MPEG-4
Visual
(1998-2001+)
MPEG-1
(1993)
ISO/IECITU-T
H.120
(1984-1988)
H.261
(1990+)
H.262 / 13818-2
(1994/95-1998+)
H.264 / 14496-10
AVC
(2003-2018+)
H.265 / 23008-2
HEVC
(2013-2018+)
Videotelephony
Computer
SD HD 4K UHD
(Advanced Video Coding
developed by JVT)
(High Efficiency Video
Coding developed by
JCT-VC)
(MPEG-2)
H.26x / 23090-3
VVC
(2020-...)
8K, 360, ...
(Versatile Video Coding
to be developed
by JVET)

6
The scope of video standardization
• Only Specifications of the Bitstream, Syntax, and Decoder are standardized:
• Permits optimization beyond the obvious
• Permits complexity reduction for implementability
• Provides no guarantees of quality
Pre-Processing Encoding
Source
Destination
Post-Processing
& Error Recovery
Decoding
Scope of Standard

7
Hybrid Coding Concept
Basis of every standard since H.261

8
Input Signal
Current Stage
Used since early days of video
compression standards, e.g.
H.261, MPEG-1/-2/-4, H.263, AVS,
H.264/AVC, HEVC and also in
most proprietary codecs (VC1, VP8 etc.)
Hybrid video coding concept

9
Input Signal DCT

10
QuantizedInput Signal DCT
010011101001…

11
QuantizedInput Signal DCT
010011101001…
Inverse DCT

12
Next Input Signal Reconstruction
vs.

13
Next Input Signal Reconstruction
010011101001…
vs.

14
Input Signal MC Prediction Residual
– =
Residual w/o MC

15
Residual DCT

16
Residual DCT Quantized
010011101001…

17
Residual DCT Quantized Inverse DCT

18
Residual MC Prediction Reconstruction
+ =
usw.

19
Performance history of standard generations
0 100 200 300
28
30
32
34
36
38
40
bit rate (kbit/s)
PSNR
(dB)
Foreman
10 Hz, QCIF
100 frames
HEVC
AVC
H.262/MPEG-2 H.261H.263 +
MPEG-4 Visual
JPEG
35
Bit-rate Reduction: 50%

20
• Improvements of motion compensation
 Variable partitions & merged partitions
 Flexible frame referencing & combined prediction
 Sub-sample precision and high performance sub-sample interpolation
 More efficient vector prediction & coding, supporting large vector ranges
• Improvements of 2D coding
 Efficient intra prediction and intra mode coding
 Design of transform bases and variable transform block sizes
• Loop filtering for artifact reduction
 Deblocking, sample-adaptive offset
• Improvements of entropy coding
 Flexible binarization of syntax elements
 Arithmetic coding
 Adaptation and usage of context information
• These are coupled with encoder optimization
 Rate distortion optimization – spend bits where they give best benefit in terms of distortion reduction
 Adaptive rate control and perceptually tuned quantization
What made this happen over the years?

21
• Group of Picture (GoP) structures allowing random access (used since MPEG-1)
• Bi-(directional) prediction for better compression performance (used since MPEG-1)
Reference picture structures
B B B B B B B
previous picture references
......
1 2 3 4 5 6 7
Uni-directional prediction
I|P B B P B B P
pre-previous picture references
Bi-directional prediction
......
1 2 3 4 5 6 7
I|P
a b

22
• Hierarchical prediction structures for frame rate scalability and further improved compression performance
(used in AVC and HEVC)
Reference picture structures
1P I /P00I /P00 3P 3P3P 2P 3P3P 3P3P2P 2P 1P I /P002P 3P
1B I /P00I /P00 3B 3B3B 2B 3B3B 3B3B2B 2B 1B I /P002B 3B
L prediction0
L prediction1
L prediction2
L prediction3
L prediction0
L prediction1
L prediction2
L prediction3
a
b
a

23
 Coder control is a non-normative part of video codecs
 Choose coding parameters at encoder side
“What part of the video signal should be coded using what method and parameter settings?”
 Constrained problem:
 Unconstrained Lagrangian formulation:
 l depends on slope of rate-distortion function:
 Small value: High rate, low distortion
 High value: Low rate, high distortion
 Can be applied in motion parameter estimation, mode decision, transform coefficient
quantization, … - typically set relationship between l and QP value
D - Distortion
R - Rate
p - Parameter Vector
 opt argmin ( ) ( )D Rl  
p
p p p
opt Targetargmin ( ) s.t. ( )D R R 
p
p p p
Coder control

24
• Video is continually increasing by resolution
 HD existing, UHD (4Kx2K, 8Kx4K) appearing
 Mobile services going towards HD/UHD
 Stereo, multi-view, 360° video
• Devices available to record and display ultra-high resolutions
 Becoming affordable for home and mobile consumers
• Video has multiple dimensions to grow the data rate
 Frame resolution, Temporal resolution
 Color resolution, bit depth
 Multi-view
 Visible distortion still an issue with existing networks
• Necessary video data rate grows faster than feasible network transport capacities
 Better video compression (than current HEVC) needed in next decade, even after availability of 5G
Motivation for improved video compression

Part II: Source formats and resolutions

26
• Sequence of pictures successively captured or rendered
• Progressive and interlaced formats
• Picture rate measured in pictures per second, unit Hertz (Hz)
• Minimum picture rate at 24Hz for impression of fluent motion [Po12]
 Standard Definition TV at 50/60Hz interlaced
 High Definition (HD) video at 50/60Hz progressive
 Ultra HD (UHD) video up to 120Hz
 Up to 300Hz considered
Structure of a Video Sequence
[Po12] Charles Poynton. Digital Video and HD: Algorithms and Interfaces. Waltham, MA, USA: Morgan Kaufman Publishers, 2012.

27
• Picture
 Set of arrays or a single array of samples with intensity values
 Monochrome picture: single intensity array
 Color video: usually three intensity arrays
⇒ three color components representing the color
 Color sample (all three components) also referred to as a pixel
(derived from picture element, sometimes also denoted as pel)
 Optional alpha channel to indicate opaqueness (transparency) for mixing applications
Pictures, Frames, and Fields

29
• Picture
 Set of pixel lines, defined number of pixels per line
 Shape of pixels not necessarily square, depends on picture format
 Examples:
Pixel Shape

30
• Human visual system less sensitive to color than to structure and texture
⇒ full resolution luma, lower resolution chroma
• Chroma sub-sampling types commonly specified by relation between
number of luma an chroma samples
YCbCr Y : X1 : X2
• With Y: number of luma pixels
• Sub-sampling format of chroma components specified by X1 and X 2
• X1 : horizontal sub-sampling
• X2 = 0: vertical sub-sampling identical to horizontal sub-sampling
• X2 = X1 : no vertical sub-sampling
Chroma Sub-Sampling

31
• Color Impression
 Visible range of spectrum range from
380 nm to 780 nm
 Impression of color: intensity density
distribution over the visible spectral range
 Colors corresponding to single wavelength:
 spectral colors or primary colors
 Human visual system has three color receptors (cone cells)
 Maximum sensitivity in the wavelength areas of red, green and blue
 Additional ’gray-scale’ receptors (rod cells): responsive in low lighting conditions
Representation of Color
Picture source: Wikipedia, artwork by Holly Fischer

32
• Visual perception split into perception of brightness (light and dark) and
chromaticity (color impression)
 Brightness is driven by summarized intensity of observed spectrum
 Color impression is driven by shape of intensity distribution
• Functional expression to represent perceived color by a mathematical
description first standardized in the CIE 1931 Standard Observer
• Color as a point in a three-dimensional XYZ space
• X,Y,Z values derived from the observed spectrum
• Three color matching functions
The CIE Standard Observer
CIE: Commission internationale de l’éclairage, http://www.cie.co.at
Standard Observer specified in ISO11664-1

33
•

34
• Normalization for expression of the chromaticity independent observed brightness
• Since , therefore
• Chromaticity specified by (x,y)-pair
• Definition of a standardized white point, e.g. ’white C’, ’white D65’
[Po12] Charles Poynton. Digital Video and HD: Algorithms and Interfaces. Waltham, MA, USA: Morgan Kaufman Publishers, 2012.
[Hu04] Robert G.W. Hunt. The Reproduction of Colour. 6th ed. Chichester, West Sussex, England: Whiley-VCH, 2004.

35
• Colour space
 Standard Dynamic Range (SDR) video
 Contrast approx. 1000 : 0
 ITU-R BT.709 colour space
 High Dynamic Range (HDR) video
 Contrast approx. 1000000 : 0
 ITU-R BT.2100 colour space
Color Spaces: Standard and Hight Dynamic Range / Wide Color Gamut
Figure from N1508: Ajay Luthra, Edouard Francois, and Walt Husak (Eds.). Requirements and Use Cases for HDR and
WCG Content Coding. Doc. N15084. Geneva, CH, 111th meeting: MPEG, Feb. 2015.
ITU-R BT.709: Parameter values for the HDTV standards for production and international programme exchange. ITU-R,
Apr. 2004. URL: http://www.itu.int/rec/R-REC-BT.709/en .
ITU-R BT.2020: Parameter values for ultra-high definition television systems for production and international programme
exchange. ITU-R, Oct. 2015. URL : http://www.itu.int/rec/R-REC-BT.2020/en
ITU-R BT.2100: Image parameter values for high dynamic range television for use in production and international
programme exchange. ITU-R, Jun. 2017. URL: http://www.itu.int/rec/R-REC-BT.2100-1-201706-I/en

36
Color Spaces: Standard and Hight Dynamic Range / Wide Color Gamut
Figure from: Ajay Luthra, Edouard Francois,
and Walt Husak (Eds.). Requirements and Use
Cases for HDR and WCG Content Coding. Doc.
N15084. Geneva, CH, 111th meeting: MPEG,
Feb. 2015.

37
HDR/WCG Conversion Practices: Scope
ITU-T H Suppl. 15 | ISO/IEC TR 23008-14, Conversion and Coding Practices for HDR/WCG Y′CbCr 4:2:0 Video with PQ Transfer Characteristics.
ITU-T H Suppl. 18 | ISO/IEC TR 23008-15, Signalling, backward compatibility and display adaptation for HDR/WCG video coding.
Figure from: Jonatan Samuelsson et al.: Conversion and Coding Practices for HDR/WCG Y′CbCr 4:2:0 Video with PQ Transfer Characteristics (Draft 4). Doc. JCTVC-Z1017. 26th meeting,
Geneva, CH: Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, Jan 2017.

Part III: State of the Art in Video Compression

39
Comparison of HEVC and the Joint Exploration Test Model (JEM) of JVET
• A glimpse on high-level syntax (HEVC)
• Coding structures
• Walk-through of the coding loop
 Intra coding
 Inter coding
 Transform coding
 Loop filters
 Entropy coding
Outline and Concept for Part III

40
• Coded Video Sequence (CVS)
 Starts with a random access point (intra-coded picture)
 One or more CVSs in a bitstream
→ Coded Video Sequence Group (CVSG)
• Network Abstraction Layer (NAL)
 Encapsulation of coded video sequence for transport and storage
 Video coding layer (VCL) NAL units
 Information directly for reconstruction of samples and pictures
 Non-VCL NAL units
 Parameter sets
 Supplemental enhancement information
 ...
Network Abstraction Layer and Video Coding Layer

41
• RBSP: Raw byte sequence payload
 Sequence of bytes comprising the coded NAL unit payload
 RBSP stop bit (=’1’) plus zero bits for byte alignment
• SODB: String of data bits
 Concatenation of bits in the RBSP bytes from MSB to LSB
 All bits needed for the decoding process
 Only the bits needed for the decoding process
NAL Unit Structure
NAL unit header

47
• Blocks and Units
 Block: Square or rectangular area in a color component array
 Unit: Collocated blocks of the (three) color components, associated syntax elements and
prediction data (e.g. motion vectors)
• Picture partitioning
 Coding Tree Blocks / Coding Tree Units (CTBs / CTUs)
 Each CTU in exactly one slice segment
 Independent slice segment: full header, independently decodable
 Dependent slice segment: very short header, relies on corresponding independent slice,
inherits CABAC state
• Slice types
 I-slice: Intra prediction only
 P-slice: Intra prediction and motion compensation with one reference picture list
 B-slice: Intra prediction and motion compensation with one or two reference picture lists
HEVC Spatial Coding Structures
CABAC: Context-based Adaptive Binary Arithmetic Coding

48
Tiles in HEVC
• Change scanning order of CTBs in picture
• Slices in tiles, or tiles in slices
• Reset of prediction and entropy coding → parallel processing

49
• Maximum CTU size: 64×64 pixels
• Quadtree partitioning of CTB into CBs
• If picture size not integer multiple of CTB size:
 Implicit CTB partitioning to meet picture size (must be multiple of 8×8 pixels)
HEVC: Coding Tree Blocks and Coding Blocks (CBs)

50
• Prediction block partitioning of a 2N×2N CB
• Transform block partitioning of a CB
 Quadtree partitioning of CB → Residual Quad Tree (RQT)
 Transform size 4×4 to 32×32
 TB size 4×4 to 64×64
 PB boundaries inside TBs allowed
HEVC: Prediction Blocks (PBs) and Transform Blocks (TBs)

51
• QTBT structure removes concept of multiple partition types (TU = PU = CU)
• Maximum CTU size: 256×256 pixels (128×128 used in common testing conditions)
• Binary trees starting from leaves of quad-tree (with horizontal / vertical split indication)
→ CU can have either square or rectangular shape
• Configuration
 MinQTSize, MaxBTSize : minimum quadtree leaf node size / maximum binary tree root node size
 MaxBTDepth, MinBTSize : maximum binary tree depth / minimum binary tree leaf node size
JEM: Quad-Tree plus Binary Tree Partitioning (QTBT)
1
1
0
1
0
0
Figure from: Jianle Chen et al. Algorithm Description of Joint Exploration Test Model 7. Doc. JVET-G1001. Torino, IT, 7th meeting: Joint Video
Exploration Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, Jul. 2017.

52
Intra Prediction

53
Intra prediction modes
• Planar prediction: mode 0
• DC intra prediction: mode 1
• Numbering from diagonal-up to diagonal-down
 Modes 2 – 18: horizontal
• Modes 19 – 34: vertical
• Horizontal: mode 10
Vertical: mode 26
Intra prediction block size
• Intra prediction mode coded per CU
• Prediction block size derived from residual quadtree
• Boundary samples of neighboring block used for prediction
• Efficient representation
• Local update of prediction source
HEVC Intra Prediction Modes

54
• Concept of HEVC as basis
 Higher number of prediction modes
 Larger maximum block size
• Chroma
 Prediction modes from neighbors
 Derived modes from collocated luma
JEM Intra Prediction Modes
Figure from: Jianle Chen et al. Algorithm Description of Joint Exploration Test Model 7. Doc. JVET-G1001. Torino, IT, 7th meeting: Joint Video
Exploration Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, Jul. 2017.

55
• HEVC
 2-tap filters
 Weight derived from prediction direction
• JEM
 4-tap filters
 Cubic interpolation for blocks with ≤ 64 samples
 Gaussian interpolation filters elsewhere
 Parameters fixed according to block size
 Same filter for all predicted samples, all modes
Interpolation Filters for Directional Intra Prediction Modes

56
• HEVC
 Boundary sample filtering for intra prediction modes 10, 26
(horizontal / vertical)
 Local, 1-sample update at boundary perpendicular to prediction direction
• JEM
 Extended to directional modes
 Boundary samples up to four columns or rows
 2-tap filter for intra modes 2 & 34
 3-tap filter for intra modes 3–6 & 30–33
Intra Prediction Boundary Filtering
Figure from: . JVET-G1001: Algorithm Description of Joint Exploration Test Model 7.

57
• Chroma samples predicted using corresponding reconstructed luma samples
𝑝𝑟𝑒𝑑 𝐶 𝑖, 𝑗 = 𝛼 · 𝑟𝑒𝑐 𝐿′ 𝑖, 𝑗 + 𝛽
• Parameters 𝛼 and 𝛽: minimize regression error between
neighbouring reconstructed luma and chroma samples
around current block
• Further prediction between chroma components with updated
parameters
𝑝𝑟𝑒𝑑 𝐶𝑟
∗
𝑖, 𝑗 = 𝑝𝑟𝑒𝑑 𝐶𝑟 𝑖, 𝑗 + 𝛼 · 𝑟𝑒𝑠𝑖 𝐶𝑏′ 𝑖, 𝑗
Multiple model CCLM mode (MMLM)
• Neighbouring luma samples and neighbouring chroma samples classified
into two groups
• Linear model for each group
JEM: Cross-Component Linear Model Prediction (CCLM)
Figures from: JVET-G1001: Algorithm Description of Joint Exploration Test Model 7.

58
• Combination of the un-filtered boundary
reference samples and HEVC style intra
prediction with filtered boundary reference
samples
 Position-dependent weighting of filtered
and unfiltered reference, configurable by
four weighing parameters (hor/ver + corner)
 Filtered reference: linear comination of un-
filtered reference and lowpass, configurable
weight
 Three predefined lowpass filters selectable
(3-tap, 5-tap, 7-tap)
 Prediction parameters stored per block size
JEM: Position Dependent Intra Prediction Combination for Planar Mode (PDPC)
Figure from: JVET-G1001: Algorithm Description of Joint Exploration Test Model 7.

59
• HEVC
 Bi-linear smoothing
 Depending on prediction block size
Mode-dependent Intra Reference Sample Smoothing (MDIS)
• Temporally adopted in JEM (removed in JEM7)
 Adaptive reference sample smoothing (ARSS)
 3-tap LPF with the coefficients of [1, 2, 1] / 4
 5-tap LPF with the coefficients of [2, 3, 6, 3, 2] / 16
Figure from: JVET-F1001: Algorithm Description of Joint Exploration Test Model 6.

60
Inter Prediction

61
Prediction from reference picture lists
• Uni-prediction
 P-slices only with List0, B-slices with List0 or List1
 HEVC: Minimum PB size 8×4 or 4×8
• Bi-prediction, only in B-slices
 One predictor from List0, one predictor from List1
 HEVC: Minimum prediction block size 8×8
Motion Compensated Prediction

62
• Merge mode
 Motion vector (MV) derived from candidate set
(spatial and temporal neighborhood)
 Merge mode candidate index coded
 No motion vector difference encoded
• Advanced motion vector prediction
 Predictor derived from candidate set
(spatial and temporal neighborhood)
 Predictor index coded
 Motion vector difference encoded
• Skip mode
 Only merge candidate signaled, no residual
HEVC: Motion Vector Representation

63
• CU: at most one set of motion parameters for each prediction direction
• Option to split large CU into sub-CUs
 Alternative temporal motion vector prediction (ATMVP)
 Fetch multiple sets of motion information from multiple blocks in collocated reference picture
 Spatial-temporal motion vector prediction (STMVP)
 Derive recursively by temporal motion vector predictor and spatial
neighbouring motion vector
• ATMVP and STMVP: additional merge candidates (list extended to max 7)
JEM: Sub-CU based motion vector prediction
Figures from: JVET-F1001: Algorithm Description of Joint Exploration Test Model 6.

64
• Locally adaptive motion vector resolution (LAMVR)
motion vector difference (MVD) coded in units of
 quarter luma samples,
 integer luma samples, or
 four luma samples
• Higher motion vector storage accuracy
 Internal motion vector storage and merge candidate at 1/16 pel (skip and merge modes only)
 SHVC upsampling interpolation filters for the additional fractional pel positions
JEM Motion Vector Representation
SHVC: Scalable High Efficiency Video Coding, HEVC Annex G

65
• Overlapped Block Motion Compensation (OBMC) previously been used in ITU-T H.263
• Switchable on CU level
 Motion compensation block boundaries except the right and bottom boundaries of CU
 Applied for both the luma and chroma components
 Performed at sub-block level for all MC block boundaries
JEM: Overlapped Block Motion Compensation

66
• Linear model for illumination changes, using a scaling factor a and an offset b  concept taken from 3D-HEVC
• Enabled or disabled adaptively for each inter-mode coded coding unit (CU)
• Least square error method employed to derive the parameters a and b
• CU in 2N×2N merge mode
 LIC flag copied from neighbouring blocks (like merge)
 Otherwise, LIC flag at CU level
JEM: Local Illumination Compensation (LIC)

67
• Motion vector field (MVF) for CU, applicable MV derived for each
4×4 block at 1/16 pel resolution
 Control point motion vector (CPMV)
• AF INTER mode
 Signaling CPMV difference from predictor
 Block width and height ≥ 8 required
• AF MERGE mode
 Derivation of CPMV from neigborhood
JEM: Affine Motion Vector Derivation for MC
















y
xxyy
y
x
yyxx
x
vy
w
vv
x
w
vv
v
vy
w
vv
x
w
vv
v
0
0101
0
0101
)()(
)()(

68
• Special merge mode based on Frame-Rate Up Conversion (FRUC) techniques
Options for
 Bilateral matching
 Template matching (applicable also for AMVP mode, CU level only)
• Motion vector derivation process
 Initial motion vector for CU of size 𝑊 × 𝐻
 Sub-CU motion refinement for blocks of size 𝑀 × 𝑀
𝑀 = max{4, min{
𝑊
2 𝐷 ,
𝐻
2 𝐷}}
JEM: Pattern Matched Motion Vector Derivation (PMMVD)
bilateral

69
• Sample-wise motion refinement on top of block-wise motion compensation for bi-prediction
• No extra signaling, applied on 4×4 block basis
• MVF determined by minimizing difference Δ between points 𝐴 and 𝐵 on trajectory
by Taylor expansion
Δ = 𝐼(0)
− 𝐼0
1
+ 𝑣 𝑥 𝜏1
𝜕𝐼 1
𝜕𝑥
+ 𝜏0
𝜕𝐼 0
𝜕𝑥
+ 𝑣 𝑦 𝜏1
𝜕𝐼 1
𝜕𝑦
+ 𝜏0
𝜕𝐼 0
𝜕𝑦
• Limited search window
• Optimized search
 First vertical, then horizontal search
 Memory usage: only access samples
inside block
JEM: Bi-directional optical flow (BIO)

70
• MVs of bi-prediction refined by bilateral template matching process
• Search between bilateral template and reference pictures
⇒ refined MV without further signaling
• Applied only with reference pictures with pocRef𝑖 < poccurr < pocRef𝑗
• Not applied if enabled in CU:
 LIC,
 Affine motion,
 FRUC, or
 sub-CU merge candidate
JEM: Decoder-side Motion Vector Refinement (DMVR)

71
Residual Coding

72
• Transform block sizes 4×4, 8×8, 16×16, and 32×32
 Integer approximations of the DCT-II transform matrix
• Additionally, integer approximation of 4×4 DST-VI transform matrix
• ’Single-norm’ design per transform block size → simple quantizer implementation
• Not all perfectly orthogonal, leakage below normalization threshold
HEVC Core Transforms

73
• Quantizer step size Δq derived from quantization parameter QP
• Exponentional relation of quantizer step sizes
• Double step size every 6 QP
Δq QP + 1 =
6
Δ 𝑞 QP
• Definition: Δq = 1 for QP = 4, thereby
Δq,0 = 2−
4
6, 2−
3
6, 2−
2
6, 2−
1
6, 1, 2
1
6
• Quantizer step sizes for given QP
Δq QP = Δq,0 QP mod 6 ⋅ 2
QP
6
Quantizer Implementation

74
• Large block-size transforms with high-frequency zeroing
 Maximum transform size up to 128 × 128
 Coefficients with column / row index > 32 set to 0
if
 Block width > 64
 Block height > 64, respectively
• Adaptive multiple core transform (AMT)
 Transform matrices quantized more accurately
 Applicable for block sizes ≤ 64 × 64
 Indicated by CU flag
 Mode-dependent transform-set selection
for intra prediction modes
JEM Transforms
Tables from: JVET-F1001: Algorithm Description of Joint Exploration Test Model 6.

75
• Motivation
 Remaining correlation between coefficients after primary transform!
 Dependency on intra prediction mode!
• Approach: mode dependent transforms (have been studies as tool for HEVC)
• MDNSST Structure:
 35×3 non-separable secondary transforms for both 4×4 and 8×8 block size
 3 NSST candidates for each intra prediction mode
 Application of transposed transform blocks for modes > 34
JEM: Mode-Dependent Non-separable Secondary Transforms (MDNSST)

76
• Only applied to the low frequency coefficients after the primary transform
 For blocks ≥ 8 × 8, application of 8 × 8 transform to lowest frequency coefficients of primary transform
 For blocks < 8 × 8, application of 4 × 4 transform to lowest frequency coefficients of primary transform
• Implementation by Hypercube-Givens Transform (HyGT)
• Two rounds for 4 × 4, four rounds for 8 × 8 secondary transforms
JEM: Mode-Dependent Non-separable Secondary Transforms (MDNSST)

77
• Searching 𝑁 similar patches in reconstructed region of picture, based on template
• Scheme of KLT matrix derivation:
 Collection of 𝑁 prediction residuals: 𝑼 = (𝒖 𝟏,𝒖 𝟐,…,𝒖 𝑵)
 covariance matrix Σ = 𝑼𝑼 𝑻
 Eigenvectors are KLT bases
• Application of proposed KLT on 4×4, 8×8, 16×16 and 32×32 coding blocks
• Note: Tool not activated in JVET Common Testing Conditions [JVET-G1010]
JEM: Signal dependent transform

78
Loop Filtering

79
• HEVC deblocking filter also used in JEM
 Filtering at prediction and transform
block edges on a 8 × 8 grid
 Independent operation on 8 × 8 blocks
possible  parallel processing enabled
• Deblocking filtering
 Boundary processed in 4-sample sections (edges)
 Filter strength determined based on analysis of top
and bottom rows of edge
 Normal: Filtering of maximum two samples into block
 Strong: Up to four samples into block
Deblocking Filter

80
• HEVC SAO filtering also used in JEM
• Local processing of samples
 Depending on local neighborhood (edge offset)
 Direction signaled, smoothing only
 Depending on sample value (band offset)
 Configurable correction of sample intensity
values for four transition bands
• Operation independent of processed samples
→ parallel processing
• Local filter parameter adaptation
• Four different offset values available (plus SAO off)
• Dedicated SAO parameters for Y, Cb, Cr
 Common SAO mode for chroma components
Sample Adaptive Offset Filter (SAO)
edge offset
band offset

81
• First loop filter in the decoding process chain of JEM
• Each luma sample in reconstructed TU is replaced by weighted average of itself and its neighbours within TU
 sample located at (𝑖, 𝑗), neighbouring sample at (𝑘, 𝑙)
 𝐼(𝑖, 𝑗 ) and 𝐼(𝑘, 𝑙): reconstructed intensity value
 𝜎 𝑑: spatial parameter (transform size, pred.mode)
 𝜎𝑟: range parameter (QP)
𝜔 𝑖, 𝑗, 𝑘, 𝑙 = exp −
𝑖 − 𝑘 2
+ 𝑗 − 𝑙 2
2𝜎𝑑
2
−
𝐼 𝑖, 𝑗 − 𝐼 𝑘, 𝑙 2
2𝜎𝑟
2
𝐼 𝐹 𝑖, 𝑗 =
σ 𝑘,𝑙 𝐼 𝑘, 𝑙 ⋅ 𝜔(𝑖, 𝑗, 𝑘, 𝑙)
σ 𝑘,𝑙 𝜔(𝑖, 𝑗, 𝑘, 𝑙)
 Integer implementation with look-up table
for division
JEM: Bilateral filter

82
• Luma component
 25 filters available for each 2×2 block, based on direction and activity of local gradients
 Diamond filter shapes (3 × 3, 5 × 5, 7 × 7)
 Classification into 25 classes, based on
 Activitiy index
 Directionality index
• Chroma components
 Diamond filter shape 5 × 5
 No classification
 Single set of filter coefficients
• Geometric transformations based on data from classification
 Transpose, vertical flip, rotation
• Filter coefficients signaled with 1st CTU, FIFO buffering for temporal prediction in inter pictures, 16 candidate
sets for intra pictures
JEM: Adaptive loop filter (ALF)

83
Entropy Coding

84
• Fixed length and variable length codes (FLC, VLC)
 High-level syntax
 Parameter sets, slice segment header
 SEI messages
 Fixed-length codes, Exp-Golomb codes
• Arithmetic coding
 Slice level, CTUs
 Context-based adaptive coding
 Bypass coding (complexity, throughput)
Entropy Coding
CTU = Coding Tree Unit
SEI = Supplemental Enhancement Information

85
• VCL NAL Unit
 FLC, VLC for header information
 CABAC for CTUs
 Byte alignment in case of multiple tiles, or with wavefront parallel
processing (not present otherwise)
Fixed and Variable Length Coding
NAL = Network Abstraction Layer
VCL = Video Coding Layer
CABAC = Context-based Adaptive Binary Arithmetic Coding
ba = byte alignment

86
• Arithmetic coding engine
 Binarization
 Context model selection
 Binary arithmetic coding
 Optimized binarization design
 Reduced number of non-bypass
bins compared to H.264 | AVC
• JEM
 Modified context modeling for transform coefficients
 Multi-hypothesis probability estimation with context-dependent updating speed
 Adaptive initialization for context models
Context-Based Adaptive Binary Arithmetic Coding (CABAC)

Part IV: Versatile Video Coding

88
• Experimental software “Joint Exploration Model“ (JEM) developed by JVET
 Intended to investigate potential for better compression beyond HEVC
 Was initially started extending HEVC software by additional compression tools, or replace existing tools
(see previous section)
• Substantial benefit was shown over HEVC, both in subjective quality and objective metrics
 Proven in "Call for Evidence" (July 2017)
 JEM was however not designed for becoming a standard (regarding all design tradeoffs)
 Call for Proposals was issued by MPEG and VCEG (October 2017)
• Call for Proposals very successful (responses received by April 2018)
 32 companies in 21 proponent groups responded
 46 category-specific submissions: 22 in SDR, 12 each in HDR and 360° video
 All responses clearly better than HEVC, some evidently better than JEM
 This marked the starting point for VVC development
Steps towards next generation standard – Versatile Video Coding (VVC)

90
• Document JVET-H1002
• Test categories
 Standard dynamic range (SDR): 5 UHD and 5 HD sequences
 High dynamic range (HDR): 3 HLG and 5 PQ sequences
 360° video (360): 5 sequences in ERP format
• Constraint sets
 Constraint set 1 (C1): Random access configuration
 Max 1.1s random access intervals, structural delay max 16 pictures
 Constraint set 2 (C2): Low delay configuration only evaluated for SDR HD sequences
 No picture reordering between input and output
• Encoding constraints
 No pre-processing, post-processing only within the coding loop
 Static quantizer setting with one-time change to meet target bitrate
 Relevant optimization methods to be reported
Joint Call for Proposals (CfP) on Video Compression with Capability beyond HEVC
UHD = Ultra High Definition, HD = High Definition, HLG = Hybrid Log Gamma, PQ = Perceptive Quantization (ITU-T BT2020), ERP = Equirectangular Projection

91
• SDR-A: 3840×2160
• SDR-B: 1920×1080
• HDR (PQ HD, HLG 4K)
• 360 Video (8K, 6K)
VVC CfP Test Sequences
FoodMarket4 60p CatRobot1 60p DaylightRoad2 60p ParkRunning3 50p Campfire 30p
BasketballDrive 50p Cactus 50p BQTerrace 60p RitualDance 60p MarketPlace 60p
Market3 HD50p Hurdles HD50p Starting HD50p ShowGirls2 HD25p Cosmos1 HD24p
DayStreet 60p PeopleInShop... SunsetBeach 60p
ChairliftRide 30p KiteFlite 30p Harbor 30p Trolley 30p Balboa 60p

92
• Category-specific submissions (total 46):
 SDR: 22 submissions (8 of which are registered only in this category)
 HDR: 12 submissions
 360°: 12 submissions (2 of which are registered only in this category)
For all categories: HEVC anchors (HM) and JEM anchors
• Proposals
 Described in JVET input documents JVET-J0011...JVET-J0033
 Participation of 32 institutions
VVC CfP Responses
JVET documents available at http://phenix.it-sudparis.eu/jvet

93
• Submissions had to provide coded/decoded sequences
 4 rate points each, two constraint conditions "low delay" (LD) and "random access" (RA)
 SDR: 5x HD (both LD and RA), 5x UHD-4K (only RA)
 HDR: 5x HD (PQ grading), 3x UHD-4K (HLG grading)
 360°: 5 sequences 6K/8K for the full panorama
• Double stimulus test with two hidden anchors HEVC-HM & JEM
 Rate points defined with lowest rate was typically less than "fair" quality for HEVC, but still possible to code
 Quality was judged to be distinguishable when confidence intervals were non-overlapping
• Evaluation: Three ways of judging benefit:
 Mean MOS over all test cases (28x4 test points: 23x4 C1, 5x4 C2 )
 Count cases where a proposal was visually better/worse than JEM
 Count cases where a proposal was visually better than HEVC (HEVC at higher rate point)
• Reports: Input subjective test [JVET-J0080], output CfP results [JVET-J1003]
Performance

94
• Measured by objective performance (PSNR), best performers report >40% bit rate reduction compared
to HEVC, >10% compared to JEM (for SDR case)
 Similar ranges for HDR and 360°
 Obviously, proposals with more elements show better performance
 Some proposals showed similar performance as JEM with significant complexity/run time reduction
 2 proposals used some degree of subjective optimization, not measurable by PSNR
• Results of subjective tests generally show similar (or even better) tendency
 Benefit over HEVC very clear
 Benefit over JEM visible at various points
 Proposals with subjective optimization also showing benefit in some cases
Performance

95
• JVET-J1003:
Report of subjective
evaluation contains
28 plots as shown,
one per sequence
• Count significant
cases of positive/
negative benefit
with non-overlapping
confidence interval
against JEM
Performance
HM
JEM
Proposals ranked by MOS (per rate point)
+1 credit
-1 credit

96
• "Mean" and "significance-count"
method suggested at least 7
proposals that were obviously
better than JEM
Performance SDR
Pxx 10
Pxx 8
Pxx 8
Pxx 6
Pxx 6
Pxx 6
Pxx 6
Pnn 3
Pnn 3
Pnn 2
Pnn 2
Pnn 1
Pnn 1
JEM 0
Pnn 0
Pnn -1
Pnn -1
Pnn -1
Pnn -2
Pnn -2
Pnn -2
Pnn -3
Pnn -4
HM -36
Pxx 6,53
Pxx 6,46
Pxx 6,41
Pxx 6,37
Pxx 6,33
Pxx 6,33
Pxx 6,26
Pnn 6,23
Pnn 6,17
Pnn 6,15
Pnn 6,13
Pnn 6,11
Pnn 6,04
Pnn 6,04
Pnn 6,03
Pnn 6,03
Pnn 6,01
JEM 6,01
Pnn 6,00
Pnn 5,96
Pnn 5,94
Pnn 5,88
Pnn 5,86
HM 4,57
Mean MOS Significance vs. JEM
60 ... +60

97
• Similar
tendency
in HDR
and 360°
categories
• Mostly same
coding tools
as in SDR
provide good
benefit
Performance HDR / 360°
Mean MOS Signif. vs. JEM
Pxx 6,04
Pxx 6,00
Pxx 5,94
Pxx 5,93
Pxx 5,86
Pnn 5,85
Pnn 5,80
Pnn 5,67
JEM 5,62
Pnn 5,60
Pnn 5,59
Pnn 5,45
Pnn 5,11
HM 4,14
Pxx 7
Pxx 3
Pxx 2
Pxx 2
Pxx 2
Pnn 1
Pnn 1
JEM 0
Pnn 0
Pnn 0
Pnn -1
Pnn -1
Pnn -6
HM -20
32 ... +32
Mean MOS Signif. vs. JEM
Pxx 6,20
Pxx 6,19
Pxx 6,06
Pxx 6,03
Pxx 5,99
Pxx 5,96
Pxx 5,86
Pnn 5,69
Pnn 5,67
Pnn 5,51
Pnn 5,45
JEM 5,11
HM 3,79
Pnn 3,45
Pxx 9
Pxx 9
Pxx 8
Pnn 7
Pxx 7
Pxx 6
Pxx 5
Pxx 4
Pnn 2
Pnn 1
Pnn 1
JEM 0
HM -9
Pnn -12
20 ... +20HDR 360°

98
• How often are best performing proposals better than HEVC at higher rate?
• Note: R11 Mbit/s; R2 1.6 Mbit/s; R3 2.8 Mbit/s; R4 4.6 Mbit/s
Performance compared to HEVC
Pbest vs HM R1 vs R2 R1 vs R3 R1 vs R4 R2 vs R3 R2 vs R4 R3 vs R4
SDR UHD 60% 40% 0% 80% 0% 20%
SDR HD/RA 40% 0% 0% 20% 0% 20%
SDR HD-/LD 40% 0% 0% 0% 0% 0%
HLG 67% 0% 0% 67% 0% 33%
PQ 40% 0% 0% 40% 0% 20%
360° 40% 20% 0% 20% 0% 60%
Rate saving  37.5%  65%  78%  43%  35%  39%

99
• How often is HEVC better than best performing proposals at lower rate?
- Note: 1-xx% means that best performing proposal is equal or better
• Note: R11 Mbit/s; R2 1.6 Mbit/s; R3 2.8 Mbit/s; R4 4.6 Mbit/s
HM vs Pbest R1 vs R2 R1 vs R3 R1 vs R4 R2 vs R3 R2 vs R4 R3 vs R4
SDR UHD 0% 0% 60% 0% 0% 0%
SDR HD/RA 0% 60% 100% 0% 80% 0%
SDR HD-/LD 0% 60% 80% 0% 80% 0%
HLG 0% 0% 100% 0% 67% 0%
PQ 0% 60% 100% 0% 60% 0%
360° 0% 40% 80% 0% 40% 0%
Rate saving  37.5%  65%  78%  43%  65%  39%

100
• The subjective quality of best performing proposals is always equal or sometimes better (~1/3 of cases) than
HEVC at next higher rate point, over all categories (with approx. 40% less rate)
• The subjective quality of best performing proposals is always equal or sometimes better (~1/5 of cases) than
HEVC at 2nd higher rate point, in SDR-UHD category (with approx. 65% less rate)
• Though it is not always the same proposal that performs best at a given rate point, it can be anticipated that
merits of different proposals can be combined
• 50% (or more) bit rate reduction with same quality will probably be achievable by the new standard

101
• New elements (some come with high complexity):
 Decoder side estimation for mode/MV derivation and sample prediction both in intra and inter coding (JEM)
 Finer partitioning: Asymmetric, geometric
 Neural networks for prediction, loop filtering, upsampling, (encoder control)
 Additional elements using template matching
 Intra block copy / current picture referencing
 Additional non-linear, de-noising and statistics-based loop filters
 Additional linear and non-linear elements in prediction
• HDR specific:
 New adaptive reshaping and quantization, also in-loop
 HDR-specific modifications of existing tools, e.g. deblocking
• 360-video specific:
 Variants of projection formats, geometry-corrected face boundary padding
 Modification and disabling of existing tools at face boundaries
CfP analysis: What was proposed?

102
• VVC Working Draft 1 / Test Model 1 (VTM1): basic approach
built on "reduced HEVC" starting point
• VTM Block structure
 Unified tree (coding block unites prediction and transform)
 CTU size 128x128, rectangular blocks (dyadic sizes),
smallest luma size 4x4
 Maximum transform size 64x64
• VTM: Some removed elements of HEVC:
 Mode dependent transform (DST-VII), mode dependent scan
 Strong intra smoothing
 Sign data hiding in transform coding
 Unnecessary high-level syntax (e.g. VPS)
 Tiles and wavefront
 Quantization weighting
VVC Working Draft and Test Model 1

103
• Report of Results from the Call for Proposals on Video Compression with Capability beyond HEVC
[JVET-J1003]
 Documentation of results per sequence, marking HM and JEM anchors, not identifying individual proponents
 Assessment of qualitative (and as far as possible quantitative) benefit of submitted technology compared to
anchors
• Working Draft 1 of Versatile Video Coding [JVET-J1001]
 "Reduced" HEVC plus quad/binary/ternary tree structure
• Test Model 1 of Versatile Video Coding (VTM 1) [JVET-J1002]
 Corresponding encoder and algorithm description
Documents issued after CfP Results

104
• Benchmark Set (BMS) was defined in addition to VTM, including the following well-known JEM tools:
• 65 intra prediction modes
• Coefficient coding
• AMT + 4x4 NSST
• Affine motion
• Geometry based adaptive loop filter
• Subblock merge candidate (ATMVP)
• Adaptive motion vector precision
• Decoder motion vector refinement
• LM Chroma mode
• Purpose: Testing benefit of technology against better performing set
 Holding extra potential features we aren’t so sure about yet
 Superset of VTM; should have significant gain over the VTM
 Unveils in CEs whether gains are independent, or how much gain remains when a tool is combined with a
set of more performant tools
 Can be a common basis for further CE tests of modified versions of features
 Not necessarily ultra-low complexity, but encoder needs to be runnable in reasonable amount of time
Benchmark Set and its role

105
• The only fundamental new element of version 1
• Simple multi-type tree split, can be alternated
Quad/binary/ternary partitioning
Example:
Figures from: JVET-J1001

106
• PSNR-based Common Test Conditions (CTC) BD-Rate savings relative to HEVC reference software (10 bit)
• Note that gain over HEVC with CTC
is lower than with CfP test set
(other sequences, higher rates,
lower resolutions)
Performance of VTM1 and initial BMS compared to HEVC
vs HM16.18 VTM BMS
4k UHD 10% 28%
1080p 8% 22%
WVGA 6% 19%
Average 8% 23%
Decode time 0.8× 2×
Encode time 2× 9×

107
• Working Draft 2 of Versatile Video Coding [JVET-K1001]
 Normative text specification
 No descriptive text of building blocks "borrowed" from HEVC: These would anyway be placeholders which
are likely to be replaced later
 Starting from this meeting, precise specification of more substantial newly adopted building blocks is being
added (see subsequent slides)
• Test Model 2 of Versatile Video Coding (VTM 2) [JVET-K1002]
 Encoder and algorithm description
 Has corresponding software implementation
Latest status (from last week)

108
• QT/BT/TT no longer “placeholder”
• Remove unnecessary partitioning restrictions
• Implicit splitting at picture boundaries
• Separate trees for intra slices
• Position Dependent Prediction Combination
• Cross Component Linear Model
• 87 intra modes (wide angles included), 3 MPM, TU binarization
• Affine MC (4x4 fixed subblock size, 4/6 parameter model switching at CU level)
• Affine MV coding
 list construction contains inheritance and derivation spatial/temporal
 improved difference coding
• Adaptive motion vector resolution (AMVR)
• Subblock MC (4x4) from ATMVP merge, 8x8 granularity motion vector storage [High precision]
Latest status (from last week): New elements of WD2 / VTM2

109
• Multiple transform selection (all are DCT/DST types) for intra and inter
• Increase max QP from 51 to 63
• Modified entropy coding supporting dependent quantization
• Sign data hiding reinvoked from HEVC
• Adaptive loop filter
 4x4 classification based (gradient strength & orientation) for luma
 7x7 luma, 5x5 chroma filters)
 enabling flag at CTU level
• Basic high-level syntax (SPS, PPS, slice)
• Update of BMS contains
 generalized Bi prediction (kind of local weighted prediction)
 Decoder-side estimation: BIO, simplified bilateral matching
 Current picture referencing (aka intra block copy)
Latest status (from last week): New elements of WD2 / VTM2

110
• For rectangular blocks, prediction directions witch angles beyond 45/135 degrees are reasonable
• This can be implemented by adding modes at both ends
• VTM2 uses a total of 85 directional intra modes now
(plus DC and planar)
Wide angular modes
Figures from JVET-K0500

111
• Alternating between two quantizers based on state transition rule allows to select an optimum
sequence of reconstruction values (e.g. by trellis-like search)
• Decoder needs to implement the sequential state transition rule
• CABAC contexts needs to be modified as well for this case
(greater than 0/1/2/... would have different meaning depending on Q0/Q1)
Dependent quantization
0 1
2 3
Q0
Q1
(k & 1) == 1
(k & 1) == 1
(k & 1) == 1
(k & 1) == 1
(k & 1) == 0
(k & 1) == 0
start
state
current
state
next state for …
(k & 1) == 0 (k & 1) == 1
0 0 2
1 2 0
2 1 3
3 3 1
-9Δ -8Δ 8Δ3Δ2Δ 4Δ 5Δ 6Δ 7Δ-Δ-6Δ-7Δ -5Δ -4Δ -3Δ -2Δ Δ0 9Δ
0
1
4-2 1-4 -3
0
-1
Q0
t
2 3
2 3 4 5-1-2-3-4-5
Q1
A AA BA B B A B
DC C D C DDCDCD
Figures from JVET-K0071

112
• Ongoing investigations on
 Improved merge, intra prediction, etc.
 Decoder-side estimation with low complexity
 Multi-hypothesis prediction and OBMC
 Diagonal and other geometric partitioning
 Secondary transforms
 New approaches of loop filtering, reconstruction and prediction filtering
(denoising, non-local, diffusion based, bilateral, etc.)
 Current picture referencing, template matching, palette mode
 Neural networks for loop filtering and prediction
• Core experiments (CE) process
 coordinated effort to investigate performance, complexity impact of proposed elements
 typically based on a specific technology proposed, or combination of several technologies
 allows detailed study / cross-checks by other interested parties
 allows identifying which elements of a proposal are useful, if it is nit useful at all, or if further improvements
are needed
Further promising fields

113
• Motivation: Towards object-oriented coding
 Follow object boundaries more closely
 Less coding artifacts where it matters
• Prediction, transform and coding driven by actual object
shape under RD-constraint
 Inter- and intra-predicted segments for handling of
disocclusions
 Overlapped wedge based filtering at partition boundary
 Shape-adaptive DCT for spatially localized transform
coding
Geometric Partitioning (GEO)
Source: M. Bläser, J. Sauer, and M. Wien, “Description of SDR and 360o video coding technology proposal
by RWTH Aachen University,” Doc. JVET-J0023, Joint Video Experts Team of ITU-T VCEG and ISO/IEC MPEG, San Diego, USA, 10th meeting, Apr. 2018

114
• GEO available for all block sizes ≥ 8×8 luma samples
• Partitioning is represented by two coordinate points 𝑃0 and 𝑃1 on the block boundary
• Prediction of two coordinate points 𝑃0 and 𝑃1 from 16 pre-defined templates (scaled for non-square blocks)
 Alternative: Spatial or temporal prediction
 Refinement: block size dependent offset
• Integration with AMVP, MERGE, FRUC
(no AFFINE (yet))
GEO: Partitioning Coding and Prediction

115
Results for GEO
JEM 7.0 JEM 7.0 + GEO
• Visual improvements at object boundaries
 Sharper contours
 Less staircase-effect
 More background details
• Objective gains (BD-rate savings)
 Against HEVC: ~33% on C1, ~25% on C2
 Against JEM: ~0.8% for both, C1 and C2
JEM 7.0

116
Results for GEO
JEM 7.0 JEM 7.0 + GEO
• Visual improvements at object boundaries
 Sharper contours
 Less staircase-effect
 More background details
• Objective gains (BD-rate savings)
 Against HEVC: ~33% on C1, ~25% on C2
 Against JEM: ~0.8% for both, C1 and C2
JEM 7.0 + GEO

117
• CE1: Partitioning
• CE2: Adaptive loop filter
• CE3: Intra prediction and mode coding
• CE4: Inter prediction and MV coding
• CE5: Arithmetic coding engine
• CE6: Transforms and transform signalling
• CE7: Quantization and coefficient coding
• CE8: Current picture referencing
• CE9: Decoder side MV derivation
• CE10: Combined and multi-hypothesis prediction
• CE11: Deblocking
• CE12: Mapping for HDR content
• CE13: Coding tools for omnidirectional video
• CE14: Post-reconstruction filtering
• CE15: Palette mode
Current Core Experiments

118
• Technically similar elements to HEVC/JEM/VVC or JVET study
 Partitioning: 128x128 "superblock" with equivalent to quad/binary sub-splits (no 1:2:1 ternary)
 Directional intra prediction, 56 directional modes, DC and "true motion" mode
 Chroma from luma prediction
 Intra block copy
 Up to 7 reference frames (allows similar structure to hierarchical B)
 Spatial/temporal motion vector referencing
 Affine motion compensation (pixel based)
 OBMC
 DCT/DST based transforms, and skip
 Adaptive arithmetic coder
 Context-based transform coefficient coding
 Film grain synthesis
 Adaptive loop filter (Wiener like)
 Deblocking
AOM's AV1

119
• Other elements
 Recursive-filtering intra predictor
 Prediction based on color palette
 Wedge-based prediction, 16 diagonal/asymmetric modes for square/rectangular blocks, similar to GEO
 Difference-modulated prediction (based on difference between two references)
 Contrast enhancement/deringing loop filter
 Self-guided filter (somewhat similar to bilateral & diffusion filters)
 Super-resolution coding mode (with coding at lower res.)
• Performance
 Owners report 20% average bit rate reduction (PSNR based)
compared to X.265-style HEVC encoder, set of full HD sequences
 Other reports indicate much less gain, or even losses compared
to HM encoder (using sequences from JVET's CTC)
 According to the same reports, JEM performs significantly better than AV1
 Some of those may not have used the newest JEM version, though
AOM's AV1

Part V: Exploratory trends and perspectives

121
• PSNR mostly used for video quality assessment
 targeting Pixel fidelity which does not necessarily reflect subjective quality
• Specific artifacts produced by video codecs:
 blockiness, blur and banding
 motion jerkiness
 time-varying edge noise ("mosquito effect")
• Alternative metrics may be clustered into
 full reference quality metrics
 reduced reference quality metrics
 no-reference quality metrics
• Note that also subjective testing methods require some reference (e.g. impairment compared to original or
another anchor)
 full reference metrics are most reliable and are also typically used for encoder decisions
• Note: Subsequent slide gives an example (SSIM) – not claimed that this is the best!
Quality metrics

122
• Example of another full-reference metric which better matches subjective quality at least for images
• Structural SIMilarity Index (SSIM) [Wang et al. 2004] measures the structural distortion by exploring three
components: Luminance, Contrast and Structural changes.
 Luminance:
 Contrast:
 Structure comparison:
• Numerous variants:
 Computation separately for regions
 Weighting by amount of motion and frame averaging for video
 Computation in complex wavelet domain for frequency weighting (MS-SSIM, multi-scale)
Perceptually adapted quality metrics example: SSIM
1
2 2
1
2
( , ) x y
x y
C
l x y
C
 
 


 
2
2 2
2
2
( , ) x y
x y
C
c x y
C
 
 


 
3
3
2
( , )
xy
x y
C
s x y
C

 



( , ) [ ( , )] .[ ( , )] .[ ( , )]SSIM x y l x y c x y s x y  


123
• Textures with large amount of detail and/or motion are often extremely challenging for video codecs
• On the other hand, the exact pixel-wise appearance is largely irrelevant for human observers, whereas
degradation of visual quality is critical
• Textures in videos can be static or dynamically changing over time
 Static textures basically rigid (but may be moving globally)
 Dynamic textures have high amount of irregular local motion
 Examples: water, smoke, head-and-shoulder sequences
• Both categories should have some stationarity properties in space and/or time, for allowing modelling as
random process expressed by parametric description – examples:
 Spectral properties
 Moments (marginal statistics and covariance statistics)
 Random field models
• In case of dynamic texture, modelling the motion properties is relevant as well, can also be understood as a
random field with certain amount of variation
Perceptual coding: Texture analysis and synthesis

124
• Example below is based on a parametric statistical description in complex wavelet domain (steerable
pyramid), with lowpass baseband and four directional orientations in bandpass layers
[Portilla, Simoncelli 2000]
• Efficient coding of parameters needed for synthesis by [Thakur, Ray 2016]
• Marginal statistics expressed as scalar values
• Auto and cross correlation statistics compressed via DCT
Static texture synthesis
Reference HEVC Intra Coding 0.223bpp Thakur et al. 0.213bpp

125
MVF MV T(i,j)
Dense OF
between adjacent
frames
Analyse
Motion
Distribution
Discard
non-probable
MV combinations
T original frames
MVF MV T'(i,j)
Compressed
MCM Mc
MCM M
Discard Intermediate
Frames
Derive Motion
Vectors
Invert MVF
Synthesized
MVF
T-2 synthesized frames
Frame Warping
and Blending
Analysis
Synthesis
Source: Chubach et al. 2017
Dynamic texture synthesis method

126
HEVC 6 of 8 frames synthesized
Dynamic texture synthesis vs. HEVC at same rate

127
• Recently, many signal processing tasks are solved by employing machine learning, deep learning and
convolutional neural networks (CNN)
• Advantages for video compression could be as follows:
• Systematic approach of optimizing with big data sets (rather than hand-crafted design)
• Detection and exploitation of nonlinear dependencies in images and video
• Inclusion of perceptual criteria by mimicking human observer behaviour
• On the downside, both training and running e.g. CNN algorithms e.g. for encoder decisions or at the decoder
may be overly complex
• Types of NN that have been proposed for image/video compression
• Autoencoders
• Adversarial networks
• Recurrent networks, particularly based on LSTM (long short-term memory) elements
Learning based approaches: Overview

128
• An autoencoder is a deep (convolutional) neural network with a sparse hidden layer that represents the code
• The encoder typically performs subsequent filtering and downsampling steps on input x per layer (note
conceptual similarity with transform coding!)
• The decoder performs complementary upsampling steps and generates output y
• Encoder and decoder are trained jointly
such that
• Difference between x and y
is minimized w.r.t. some distortion
• Code z is as sparse (minimum amount
of information) as possible
• Use Bayes formula P(z|x) P(x|z)P(z)
and minimize Kullback Leibler divergence
of conditional probabilities to achieve
the latter [Kingma, Welling 2014]
Convolutional Neural Networks: Autoencoders (AE)
Source: Wikipedia
x y
z=F(x) y=G(z)

129
• Generator net G generates samples y from random variables z (G would be the decoder, z the code)
• Discriminator net D decides whether the samples could match with real-world images x which stem from an
unknown distribution P(x)
• Generator and discriminator nets are trained iteratively, optimizing following function
• Minimax optimization:
• Train D such that V is maximized
• Train G such that V is minimized
• Problem: There is no corresponding
mapping from x to z (no encoder)
• Solution (e.g. [Santurkar et al. 2017]):
Combination AE and GAN, i.e. train
F(x) from AE joint with G(z) and D(⋅)
Convolutional Neural Networks: Generative Adversarial Networks (GAN)
Source: Slideshare.net – K. McGuinness
z
x
y
G(z)
D(x) or D(y)

130
• Variable-rate and variable-size coding not straightforward
• Option to operate over small patches / blocks
• Train separate for different content complexity
• Code residual differences
• Cost functions for rate distortion optimization not straightforward to implement
• Option to re-formulate rate constraint as energy minimization problem
• Hybrid solutions where conventional entropy coding is operated after network output at encoder
• None of these solutions may lead to a consistent optimum, and may require to be driven by some external
decision mechanism
Convolutional Neural Networks: General problems and possible solutions

131
• Autoencoder could be interpreted as a monolithic non-linear transform (though operating with local kernels)
– see previously used notation in light green below
• A similar approach is proposed in [Ballé et al. 2017], with additional criteria for rate distortion optimization
and quantization / entropy coding on the sparse representation (called y here)
• Perceptual optimization based on nonlinear "generalized divisive normalization" and L2 norm minimization in
nonlinear space
• Authors report significantly improvement on detail structures, also improved MS-SSIM compared to
conventional codecs – transform optimized based on cost criterion below:
Trained non-linear transforms
(x)
(y)
(z)F(x)
G(z') (z')
Source: Ballé et al. 2017

132
• All methods discussed so far were developed for still image coding, and could be used in intra coding for video
• Main problem: Motion compensation is a very effective tool, and can hardly be trained into a network (or would
be tremendously more complex than conventional motion estimation)
• Some work on using CNN for
 Sub-pel interpolation
 Resolution up-conversion
 Post-processing
 Texture synthesis and inpainting
• It is also not as simple to train for perceptual criteria in video
NN for video

133
• NN-based approaches were so far more successful in still image coding rather than video coding
 Perceptual criteria also better understood for images
• In video coding, motion compensation is a most effective key component
 Requires motion estimation for which "conventional" algorithms appear to be less complex
 Analogy: Eye tracking – the brain processes a motion compensated input
• CNN have been demonstrated to provide benefit in context of video coding for
 Resolution up-conversion
 Post-processing and loop filtering
 Intra coding
 Encoder optimization, in particular partitioning which is basically a segmentation problem
NN for video

134
• Switching to lower resolution is common (an necessary) when data rate is low
• Video is locally varying by detail, and may not require encoding at full resolution everywhere
• Lower resolution may also be useful with high motion, motion blur, etc.
• Need to code less information in such irrelevant areas can save data rate
• Tools "Reduced Resolution Update" or "Dynamic Resolution Conversion" were included in MPEG-4 part 2 and
H.263+, but not well understood by that time
• Requires tools for
 downsampling when generating prediction from reference
 signalling the coding with variable resolution
 upsampling for generating full-resolution picture
• Three examples shown subsequently:
 Down/Up-sampling using neural networks / conventional filters
 Coding B pictures of dynamic texture with low resolution
 Dictionary-based super-resolution upsampling
Variable-resolution coding

135
• Basic idea of dynamic resolution coding:
 Downsample and code by lower resolution (less bitrate cost)
 Upsample at decoder side to full resolution
 Encoder decides using full res, conventional or CNN-based down- and upsampling
 CNN-based could generate super-resolution upsampling, sharper edges, etc.
• Can be implemented in combination with intra and inter prediction coding
• Operated on block by block basis
CNN for resolution up-conversion
Figure from JVET-J0032

136
• Loop filtering is common in video coding
 removes compression artifacts from reconstruction
 improves prediction from reconstructed frames
• Generally, signal-adaptive and non-linear filters
 e.g., de-blocking, de-ringing, de-banding
 edge-adaptive & Wiener optimized
 bi-lateral filters
 ...
• CNN reconstruction
provides additional
gain (3-5% rate red.)
and might replace
some conventional
filters
• Can be operated on
block basis, parallel
processing possible
CNN for loop filtering
Figures from JVET-I0022
Process Unit
Block7
2*padding_size
Block6
Block1 Block2 Block3 Block4 Block5
Block8 Block9 Block10
2*padding_size
padding_size
2*padding_size
padding_size
Conv1 (5, 5, 45)
Conv2 (3, 3, 54)
Conv3 (3, 3, 58)
Conv4 (3, 3, 48)
Conv5 (3, 3, 51)
Conv6 (3, 3, 40)
Conv7 (3, 3, 31)
Convolution8 (3, 3, 1)
Normalized QP MapNormalized Y/U/V
Concat
Summation
ConvL (M,N,KL)
ConvolutionL (M,N,KL)
ReLU
M: kernel width
N: kernel height
KL: kernel number

137
• Neural networks were demonstrated to provide improved intra prediction, compared to conventional
directional and planar modes
• Mostly fully connected networks
have been used for this
purpose (no convolutional
layers)
• Average rate reductions
of 4-5% (for intra coding)
have been reported
• Examples of prediction
demonstrate the benefit
of non-linear processing
Neural networks for intra prediction
Figure
from JVET-J0037
Figures from Li et al. IEEE-TCSVT, July 2018

138
• Key pictures coded with full resolution
• Non-key pictures coded with reduced resolution
• Upsampling based on motion-compensated steerable pyramid
Variable-resolution coding for dynamic texture (Thakur et al. 2017)
Ref pic L0 Ref pic L1
Lowpass Lowpass Lowpass
Original Pictures
Reconstructed
key Pictures
Predicting
Non-Key Pictures

139
• Motion vectors initially estimated from downsampled lowpass key pictures, refined and applied in bandpass
and highpass components of non-key pictures
• Authors report significant bit rate saving (20-30% average) for dynamic texture content, whereas subjective
quality is preserved compared to full-resolution coding
Variable-resolution coding for dynamic texture (Thakur et al. 2017)
Motion
Estimation
Motion
Compensation
Bandpass
Current LowpassReference Lowpass
HighpassHighpass
Bandpass
Key picture Non-key picture

140
• Low and high-resolution dictionaries trained jointly with sparsity constraint (large data base)
• Up-converter searches low number of matching dictionary bases in low res, and applies the corresponding
bases from the high res dictionary
Low-resolution coding with dictionary-based up-conversion (Schneider et al. 2017)

141
• Scheme run with overlapping blocks
• Provides sharp reconstruction of structures and edges
• Authors report 2-3% rate gain when used in upsampling for HEVC scalable coding
Low-resolution coding with dictionary-based up-conversion (Schneider et al. 2017)

Trends and Recent Developments in Video Coding Standardization

Trends and Recent Developments in Video Coding Standardization

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Trends and Recent Developments in Video Coding Standardization

Ähnlich wie Trends and Recent Developments in Video Coding Standardization (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Trends and Recent Developments in Video Coding Standardization