文献紹介：Learning Video Stabilization Using Optical Flow

!"#$%&%'()&*"+(,-#.&/&0#-&+%(
12&%'(34-&5#/(6/+7
!"#$%&'()*'+$,"'+$-$-../01"*'234+5656
仁田智也大谷碧生（名工大玉木研）
英語論文紹介

!"#$%&'(#)%"
nニューラルネットワークによって振れ補正
• オプティカルフローを使用
• 従来手法よりも計算量を軽減
• 周波数領域の損失関数を提案
• 427'89.:';<)9==>?9$@A*'234+56BCDの適用
(Wulff&Black, CVPR2015)

*+,-#+&./%$01
n物理ベース手法
• エピポーラ幾何学を用いる
• 深度カメラとライトフィールドカメラを用いる
nニューラルネットワーク
• 動画毎に最適化プロセスが必要
• 処理時間が遅い
• カラー画像からエンドツーエンド
• 色情報から空間変換は困難
• データセットがEFFGH0$I';J)K*'2.-G)0F/'L/$G1"@M'8./)-Dのみ
• N6本のビデオ
• 過学習

2)3+,)"+
nH0$&FB
• 大きな動きの除去
• アフィン変換
• オプティカルフロー生成
• 89.:OF05';P9&K*'234+56BQD
• R$MA生成
• 427'89.:生成
nH0$&F5
• H0$I"9"S$0".%'OF0:./A
nH0$&FT
• <$/G'8"F9UとR$MAのフィッティング

4-10.5+"+$-#)%"
nオプティカルフローのVつの問題点
• フレームの動きと物体の動きが一致しない
• 物体の境界が不正確
• 均一な色の部分
• ボケている部分
n問題となる部分にマスクをする

267.8,%9.8)##)":
nマスクをして穴の空いた部分の補完
• 427'89.:で次元の削減
• 最小5乗法によって線形回帰
• 𝑄!：フレーム%の主成分の行列
• 𝐹!：フレームnのオプティカルフロー
• 𝑐!：𝐹!と𝑄!をフィッティングする係数
min
𝒄!
$
𝐐"𝐜𝐧 − (
𝐅" $
+ 𝜂 𝐜" $
𝐜" = ($
𝐐"
% $
𝐐" + 𝜂𝐈)&' $
𝐐"
% (
𝐅"

マスクについて
n動いている物体
• Bフレーム前になかったピクセルが出現
• 実際の動きと異なるトラックになる
• 歪みの原因
nマスク部分の補完
• マスクをしている部分にも動きがある
• 補完なしだと歪みの原因

;+#9%$0.-"&.<$-)")":
nネットワークの構造
• ;W1.)K*'72R'X/$%MY'L/$G1Y'56BZDに
よって提案された構造
n多くのフレームを一緒に手ぶれ補正した方が
低周波の揺れをよりよく処理が可能
• フレーム数が多いと学習が困難
• 56フレームのオプティカルフローフィー
ルドを入力として使用
• H9"U"%&'<"%U.:を使用
by a alpha-composition of the transformed images into a single
image in a back-to-front order. Both the planar transformation and
alpha composition are differentiable, and can be easily incorporated
into the rest of the learning pipeline.
Planar transformation. Here we describe the planar transfor-
mation that inverse warps each MPI RGBA plane onto a target
viewpoint. Let the geometry of the MPI plane to be transformed
(i.e. the source) be n · x + a = 0, where n denotes the plane normal,
x = [us ,vs , 1]T the source pixel homogeneous coordinates, and a
the plane offset. Since the source MPI plane is fronto-parallel to the
reference source camera, we have n = [0, 0, 1] and a = −ds , where
ds is the depth of the source MPI plane. The rigid 3D transforma-
tion matrix mapping from source to target camera is defined by
a 3D rotation R and translation t, and the source and target cam-
era intrinsics are denoted ks and kt , respectively. Then for each
pixel (ut ,vt ) in the target MPI plane, we use the standard inverse
homography [Hartley and Zisserman 2003] to obtain






us
vs
1






∼ ks
'
RT
+
RT tnRT
a − nRT t
(
k−1
t






ut
vt
1






(2)
Therefore, we can obtain the color and alpha values for each target
pixel [ut ,vt ] by looking up its correspondence [us ,vs ] in the source
image. Since [us ,vs ] may not be an exact pixel coordinate, we use
bilinear interpolation among the 4-grid neighbors to obtain the
resampled values (following [Jaderberg et al. 2015; Zhou et al. 2016]).
Alpha compositing. After applying the planar transformation
to each MPI plane, we then obtain the predicted target view by
alpha compositing the color images in back-to-front order using the
standard over operation [Porter and Duff 1984].
Table 1. Our network architecture, where k is the kernel size, s the stride,
d kernel dilation, chns the number of input and output channels for each
layer, in and out are the accumulated stride for the input and output of each
layer, and input denotes the input source of each layer with + meaning
concatenation. See Section 3.5 for more details.
Layer k s d chns in out input
conv1_1 3 1 1 99/64 1 1 I1 + ˆ
I2
conv1_2 3 2 1 64/128 1 2 conv1_1
conv2_1 3 1 1 128/128 2 2 conv1_2
conv2_2 3 2 1 128/256 2 4 conv2_1
conv3_1 3 1 1 256/256 4 4 conv2_2
conv3_2 3 1 1 256/256 4 4 conv3_1
conv3_3 3 2 1 256/512 4 8 conv3_2
conv4_1 3 1 2 512/512 8 8 conv3_3
conv4_2 3 1 2 512/512 8 8 conv4_1
conv4_3 3 1 2 512/512 8 8 conv4_2
conv5_1 4 .5 1 1024/256 8 4 conv4_3 + conv3_3
conv5_2 3 1 1 256/256 4 4 conv5_1
conv5_3 3 1 1 256/256 4 4 conv5_2
conv6_1 4 .5 1 512/128 4 2 conv5_3 + conv2_2
conv6_2 3 1 1 128/128 2 2 conv6_1
conv7_1 4 .5 1 256/64 2 1 conv6_2 + conv1_2
conv7_2 3 1 1 64/64 1 1 conv7_1
conv7_3 1 1 1 64/67 1 1 conv7_2
3.5 Implementation details
Unless specified otherwise, we use D = 32 planes set at equidistant

;+#9%$0.-"&.<$-)")":
n損失関数
• 動きに対する損失関数
• 周波数領域に対する損失関数
n学習のプロセス
• データセット
• 01F'+F$9[M0$0FB6';W1.)K*'72R'X/$%MY'L/$G1Y'56BZD'
• 動きのパターンや色のバリエーションが十分含まれている
• 5次元アフィン変換
• 学習段階を２段階に分ける

損失関数
𝑞(,"*' = 𝑝+," + 2
𝐹"(𝑝+,")
̂
𝑝+," = 𝑝+," + 𝑊
"(𝑝+,") 6
𝑞(,"*' = 𝑞(,"*' + 𝑊"*'(𝑞(,"*')
𝑝",!
̂
𝑝",!
̂
𝑝",!
𝑞$,!%&
(
𝑞$,!%&
(
𝑞$,!%&
)
F'
𝑊
!
𝑊!%&
-.0".%'9.MM
𝑛 𝑛 + 1

損失関数
𝐿, = 8
"-'
. &'
8
+
̂
𝑝+," − 6
𝑞(,"*' $
𝐿/ = 8
"-$
. &'
9
G ; ℱW" $
n-.0".%'9.MM
n=/F])F%@#'U.-$"%'9.MM
9
G !"#$%&#') ( ')"*"$%&#')
n!"#$%&'()*++!)"',)-

学習段階
１段階目２段階目
0. The visual comparison of (a)the warped frames using
re 10. The visual comparison of (a)the warped frames using
𝐿! = 𝐿" =.>?.∗ 𝐿# 𝐿$ = 𝐿"
without 𝐿! with 𝐿!

テストと実装の詳細
n+$:'<$/G'8"F9U
• 境界にアーチファクトが含まれる可能性がある
• オプティカルフローのP%G$"%0"%&が原因
• 427'89.:'を <$/G'8"F9Uとして使用
raw outputs of the networks trained with and (b)without Lf .
red and green boxes indicate the noisy regions. The frequency
ain loss helps to improve the quality of the warp field.
re 11. The visual comparison of (a)the frames warped with the
warp field and (b)the PCA Flow smoothed warp field. Due to
npainting of the optical flow, the raw warp field may contain
acts at the valid/invalid region boundaries.
ning progress. The Case-I (red curve) represents the train-
with Lm only. The Case-II (green curve) represents the
Figure 12.
long video
second fra
2), the inp
frame(!
F2)
ous frame(
n is the fr
k is the w
field for t
window. W
since the
W1,k−1 i
The red and green boxes indicate the noisy regions. The frequency
domain loss helps to improve the quality of the warp field.
Figure 11. The visual comparison of (a)the frames warped with the
raw warp field and (b)the PCA Flow smoothed warp field. Due to
the inpainting of the optical flow, the raw warp field may contain
artifacts at the valid/invalid region boundaries.
training progress. The Case-I (red curve) represents the train-
ing with L only. The Case-II (green curve) represents the
F
lo
se
2
fr
o
n
k
fi
w
si
W
Figure 10. The visual comparison of (a)the
the raw outputs of the networks trained wi
The red and green boxes indicate the noisy r
domain loss helps to improve the quality of t
Figure 11. The visual comparison of (a)the f
raw warp field and (b)the PCA Flow smooth
the inpainting of the optical flow, the raw w
artifacts at the valid/invalid region boundari
training progress. The Case-I (red curve)
ing with Lm only. The Case-II (green c
Figure 10. The visual comparison of (a)the warp
the raw outputs of the networks trained with and
The red and green boxes indicate the noisy regions
domain loss helps to improve the quality of the war
Figure 11. The visual comparison of (a)the frames
raw warp field and (b)the PCA Flow smoothed wa
the inpainting of the optical flow, the raw warp fi
artifacts at the valid/invalid region boundaries.
training progress. The Case-I (red curve) repre
n H9"U"%&'<"%U.:
• 動画を滑らかにするため
e visual comparison of (a)the warped frames using
ts of the networks trained with and (b)without Lf .
reen boxes indicate the noisy regions. The frequency
W", 0
𝑛: <"%U.:内のフレーム
𝑘: <"%U.:インデックス

結果：視覚的比較
n従来手法
• 例１
• 窓の形が変形
• 車体に歪み
• 色が均一
• オーバークロッピング
• 例２
• 局所的な歪み
• 動きとのミスマッチ

結果：定量的比較
nO^Hデータセット ;_")K*'72R'X/$%MY'L/$G1YDを用いて比較
n提案手法のオプティカルフローに対してガウシアンノイズを追加
nクロッピングの比較
• 動画のジャンルごとに比較

n歪みの比較

n安定性の比較

結果
n提案手法
• オプティカルフローにガウシアンノイズがあってもロバスト
• ノイズがあった場合、歪みとクロッピングの性能向上
• マスク部分が増えるのが原因
n計算時間
• 小さな計算時間ロスで大きく性能向上
• 2./F"Q`ZQ66A*+XJ56Z6X"を使って
• ステージ１：5Q6-M
• ステージ5*T：合わせてT66-M

まとめ
n提案
• 新規の深層学習ベースのビデオ安定化手法
• ビデオフレームを安定化させるためのピクセル単位の<$/G 8"F9Uを推定
• パイプライン
• オプティカルフローフィールド内の無効な領域を検出
• 無効な領域をインペイント
• 出力ワープフィールドを平滑化する
n既存の深層学習ベースの手法よりもロバスト
n最先端の手法と比較して視覚的にも定量的にも優れ，約T倍の速度向上を実現
n今後の課題
• 入力映像を安定化映像に直接変換するエンド・ツー・エンドのネットワークの作成
• ネットワークの学習を可能にするデータセットの作成

文献紹介：Learning Video Stabilization Using Optical Flow

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie 文献紹介：Learning Video Stabilization Using Optical Flow

Ähnlich wie 文献紹介：Learning Video Stabilization Using Optical Flow (20)

Mehr von Toru Tamaki

Mehr von Toru Tamaki (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (8)

文献紹介：Learning Video Stabilization Using Optical Flow