IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform
1. Paralleling Variable Block Size Motion
Estimation of HEVC On
CPU plus GPU Platform
Xiangwen Wang1, Li Song2, Yanan Zhao2, Min Chen1
1Shanghai University of Electric Power
2 Shanghai Jiao Tong University
3. Introduction
HEVC is the newest video coding standard introduced by ITU-T VCEG and
ISO/IEC MEPG.
Compared with H.264/AVC, HEVC decreases the bitrate by 50% percent
on average while maintaning the same visual quality.
BQMall_832x480. Left: HEVC 1.5Mbps, right: x264 3.0Mbps
4. Introduction
However, encoding complexity is several times more complex than H.264.
• RDO: iterate over all mode and partition combinations to dicide the best coding
information
• RDoQ: iterate over many QP candidates for each block
• Intra: prediction modes increased to 35 for luma
• SAO: works pixel by pixel
• Quadtree structure: bigger block sizes and numerous partition manners
• Some other highly computational modules ...
As a result, the traditional method which performs the encoding in a
sequential way could no longer provide a real-time demand, especially
when it comes to HD (1920x1080) and UHD (3840x2160) videos.
Parallellism in the encoding procedure must be extensively utilized.
5. Overview OF VBSME In HEVC
• Three independent block concepts
• CU - Coding Unit
• PU - Prediction Unit
• TU - Transform Unit
• The total number of allowed PU
size is 12 (from 64x64 to 4x8/8x4)
up to 425 times ME for one
64x64 CTU
(5+4x5+16x5+64x5 = 425)
CU size and PU partition structure
CU PU
Depth1: 64x64
Depth2: 32x32
Depth3: 16x16
Depth4:8x8
6. Two stages:
• ME to select best MV for candidate PUs
• CU depth and PU partition mode decision
MV selection criterion for each PU:
Jpred,SAD =SA(T)D + λpred * Rpred
CU sizes and PU partition mode decision:
Jmode =SSD + λmode * Rmode,
To calculate the Jmode for each PU, reconstruction and
entropy coding of all syntaxes are necessary, the
complexity is beyond the computational capability of
common computers for real applications.
Mode Decision with VBSME In HM
7. The proposed parallel encoding framework
Copy to GPU fEnc
Interpolate&
Border Pad
fEnc
ME Kernel
64x64~8x8
PU MVs
Half/quarter
pixel img
buff
Half/quarter
pixel img
buff
fRec
Encode &
Reconstruct
64x64~8x8
PU MVs
MC
CPU GPU
GPU-MEMCPU-MEM
Read Img
Launch new
LCU Line ME
Sync to
last LCU Line
ME
Launch
Interpolate
Entropy
coding
Sync to
Interpolate
All LCU line?
N
Mode
Decision
fRec
Next frame
frame
loop
LCU
line
loop
Lunch ME
for one CTU
line
8. Fast PU partition mode decsion scheme
SKIP ?
Half/quarter
pixel img
buff
CBF_fast ?
Fast CU
partition
64x64~8x8
PU MVs
PART_2Nx2N?
CU partition
or next CU
N
Y
Y
CU depth==4
CU_idx==4?
Y
RD cost
Calculate
CU depth
= 4
Sync to
last LCU Line
ME
CPU-MEM
MC
The MV and residual information are employed for PU
partition decision
Two edge feature parameters:
00 01 10 11
_
8
S S S S
V
QP stepN
00 10 01 11
_
8
S S S S
H
QP stepN
If (H==V &&H!=0)
PART_2Nx2N
Elseif (H==V and H==0)
PART_NxN
Elseif H>V
PART_Nx2N
else
PART_2NxN
9. Parallel realization of VBSME on CUDA
8x8 block size
SAD calculation
16x16 block size
Jpred calculation
Integer Pixel Jpred
Comparison 16x16
Fractional pixel
MV refinement
Variable block size
Jpred generation
Variable block size
Jpred calculation
Integer Pixel Jpred
Comparison
four 16x16 lines
Variable block size ME
The MV selection criterion is as
follow:
Jpred =SAD +λpred * DMV = SAD +λpred *(MV_C - PMV)
where MV_C: current point MV,
PMV: next slice
10. PMV for MV cost calculate
MV0 MV1 MV2
MV3
MV4
PMV=medium(MV0, MV1, MV2, MV3, MV4)
one CTU(64x64) line is divided into four 16x16 block lines;
The ME process of each 16x16 line is done by GPU sequentially;
The MVs of 16x16 block size are used as the MV predictions for
all other block sizes.
11. Variable block size SAD Generation
8x8
0 1 2 3 4 5 6 7 63626160
16x16
32x32
64x64
Variable block size SAD Generation on CUDA
12. Experimental Results
Platform:Z620 = NVIDIA Tesla C2050+i7@2.6G, with Win7 OS
The CUDA driver version of the GPU is 5.0 and the CUDA
Capability version number is 2.0.
The search range is 64x64 with the full search strategy for IMV and
24 fractional-pixel positions around the IMV.
sequence CPU(fps) GPU(fps)
Speedup
ratio
Traffic_2560x
1600_crop
0.21 23.77 113.2
ParkScreen_1
920x1080_24
0.69 77.76 112.7
The speed-up ratio is about 113 times
13. Experimental Results:RD comparison
Note1, The propose algorithm is realization based on X265 encoder, a first open source
encoder implementation of HEVC "x265 project, http://code.google.com/p/x265/";
Note2, the Cactus_Proposed implies the RD curve generated by the X265 encoder with
the proposed algorithm.
14. Conclusion
We present a parallel friendly VBSME(variable block
size motion estimation) scheme which make full of
available computation resources from CPU and GPU
respectively
Preliminary results are reported with speedup ratio over
100 times compared to single thread CPU only solution
We will continue to exploit parallelism, targeting a
4K@30fps realtime HEVC encoder over multicore CPU
and GPGPU platform.