Jiyang Yu, Ravi Ramamoorthi; Learning Video Stabilization Using Optical Flow, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8159-8167
https://openaccess.thecvf.com/content_CVPR_2020/html/Yu_Learning_Video_Stabilization_Using_Optical_Flow_CVPR_2020_paper.html
8. ;+#9%$0.-"&.<$-)")":
nネットワークの構造
• ;W1.)K*'72R'X/$%MY'L/$G1Y'56BZDに
よって提案された構造
n多くのフレームを一緒に手ぶれ補正した方が
低周波の揺れをよりよく処理が可能
• フレーム数が多いと学習が困難
• 56フレームのオプティカルフローフィー
ルドを入力として使用
• H9"U"%&'<"%U.:を使用
by a alpha-composition of the transformed images into a single
image in a back-to-front order. Both the planar transformation and
alpha composition are differentiable, and can be easily incorporated
into the rest of the learning pipeline.
Planar transformation. Here we describe the planar transfor-
mation that inverse warps each MPI RGBA plane onto a target
viewpoint. Let the geometry of the MPI plane to be transformed
(i.e. the source) be n · x + a = 0, where n denotes the plane normal,
x = [us ,vs , 1]T the source pixel homogeneous coordinates, and a
the plane offset. Since the source MPI plane is fronto-parallel to the
reference source camera, we have n = [0, 0, 1] and a = −ds , where
ds is the depth of the source MPI plane. The rigid 3D transforma-
tion matrix mapping from source to target camera is defined by
a 3D rotation R and translation t, and the source and target cam-
era intrinsics are denoted ks and kt , respectively. Then for each
pixel (ut ,vt ) in the target MPI plane, we use the standard inverse
homography [Hartley and Zisserman 2003] to obtain
us
vs
1
∼ ks
'
RT
+
RT tnRT
a − nRT t
(
k−1
t
ut
vt
1
(2)
Therefore, we can obtain the color and alpha values for each target
pixel [ut ,vt ] by looking up its correspondence [us ,vs ] in the source
image. Since [us ,vs ] may not be an exact pixel coordinate, we use
bilinear interpolation among the 4-grid neighbors to obtain the
resampled values (following [Jaderberg et al. 2015; Zhou et al. 2016]).
Alpha compositing. After applying the planar transformation
to each MPI plane, we then obtain the predicted target view by
alpha compositing the color images in back-to-front order using the
standard over operation [Porter and Duff 1984].
Table 1. Our network architecture, where k is the kernel size, s the stride,
d kernel dilation, chns the number of input and output channels for each
layer, in and out are the accumulated stride for the input and output of each
layer, and input denotes the input source of each layer with + meaning
concatenation. See Section 3.5 for more details.
Layer k s d chns in out input
conv1_1 3 1 1 99/64 1 1 I1 + ˆ
I2
conv1_2 3 2 1 64/128 1 2 conv1_1
conv2_1 3 1 1 128/128 2 2 conv1_2
conv2_2 3 2 1 128/256 2 4 conv2_1
conv3_1 3 1 1 256/256 4 4 conv2_2
conv3_2 3 1 1 256/256 4 4 conv3_1
conv3_3 3 2 1 256/512 4 8 conv3_2
conv4_1 3 1 2 512/512 8 8 conv3_3
conv4_2 3 1 2 512/512 8 8 conv4_1
conv4_3 3 1 2 512/512 8 8 conv4_2
conv5_1 4 .5 1 1024/256 8 4 conv4_3 + conv3_3
conv5_2 3 1 1 256/256 4 4 conv5_1
conv5_3 3 1 1 256/256 4 4 conv5_2
conv6_1 4 .5 1 512/128 4 2 conv5_3 + conv2_2
conv6_2 3 1 1 128/128 2 2 conv6_1
conv7_1 4 .5 1 256/64 2 1 conv6_2 + conv1_2
conv7_2 3 1 1 64/64 1 1 conv7_1
conv7_3 1 1 1 64/67 1 1 conv7_2
3.5 Implementation details
Unless specified otherwise, we use D = 32 planes set at equidistant
12. 学習段階
1段階目 2段階目
0. The visual comparison of (a)the warped frames using
re 10. The visual comparison of (a)the warped frames using
𝐿! = 𝐿" =.>?.∗ 𝐿# 𝐿$ = 𝐿"
without 𝐿! with 𝐿!
13. テストと実装の詳細
n+$:'<$/G'8"F9U
• 境界にアーチファクトが含まれる可能性がある
• オプティカルフローのP%G$"%0"%&が原因
• 427'89.:'を <$/G'8"F9Uとして使用
raw outputs of the networks trained with and (b)without Lf .
red and green boxes indicate the noisy regions. The frequency
ain loss helps to improve the quality of the warp field.
re 11. The visual comparison of (a)the frames warped with the
warp field and (b)the PCA Flow smoothed warp field. Due to
npainting of the optical flow, the raw warp field may contain
acts at the valid/invalid region boundaries.
ning progress. The Case-I (red curve) represents the train-
with Lm only. The Case-II (green curve) represents the
Figure 12.
long video
second fra
2), the inp
frame(!
F2)
ous frame(
n is the fr
k is the w
field for t
window. W
since the
W1,k−1 i
The red and green boxes indicate the noisy regions. The frequency
domain loss helps to improve the quality of the warp field.
Figure 11. The visual comparison of (a)the frames warped with the
raw warp field and (b)the PCA Flow smoothed warp field. Due to
the inpainting of the optical flow, the raw warp field may contain
artifacts at the valid/invalid region boundaries.
training progress. The Case-I (red curve) represents the train-
ing with L only. The Case-II (green curve) represents the
F
lo
se
2
fr
o
n
k
fi
w
si
W
Figure 10. The visual comparison of (a)the
the raw outputs of the networks trained wi
The red and green boxes indicate the noisy r
domain loss helps to improve the quality of t
Figure 11. The visual comparison of (a)the f
raw warp field and (b)the PCA Flow smooth
the inpainting of the optical flow, the raw w
artifacts at the valid/invalid region boundari
training progress. The Case-I (red curve)
ing with Lm only. The Case-II (green c
Figure 10. The visual comparison of (a)the warp
the raw outputs of the networks trained with and
The red and green boxes indicate the noisy regions
domain loss helps to improve the quality of the war
Figure 11. The visual comparison of (a)the frames
raw warp field and (b)the PCA Flow smoothed wa
the inpainting of the optical flow, the raw warp fi
artifacts at the valid/invalid region boundaries.
training progress. The Case-I (red curve) repre
n H9"U"%&'<"%U.:
• 動画を滑らかにするため
e visual comparison of (a)the warped frames using
ts of the networks trained with and (b)without Lf .
reen boxes indicate the noisy regions. The frequency
W", 0
𝑛: <"%U.:内のフレーム
𝑘: <"%U.:インデックス