Pr083 Non-local Neural Networks

PR 083 (29th April, 2018)
Taeoh Kim

• CVPR18 Poster
• Inspired by
Non-local Means

Slides fromBIL717ImageProcessing 2012

• Average Similar Pixels
• Do not Average non-Similar Pixels
Problem)
Not Enough Similar Pixels in LOCAL REGIONS

• Average Similar Pixels
• Do not Average non-Similar Pixels
Problem)
Not Enough Similar Pixels in LOCAL REGIONS
 Get More Samples in Non-LOCAL REGIONS

Slides fromBIL717ImageProcessing 2012
𝑁𝐿𝑀𝐹 𝐼 𝑝 =
1
𝑊
෍
𝑞
𝐺 𝜎 𝑉𝑝 − 𝑉𝑞 2
𝐼 𝑞
𝐵𝐴 𝐼 𝑝 =
1
𝑊
෍
𝑞
𝐼 𝑞
𝐺 𝐼 𝑝 =
1
𝑊
෍
𝑞
𝐺 𝜎 𝑝 − 𝑞 2 𝐼 𝑞
𝐺 𝐼 𝑝 =
1
𝑊
෍
𝑞
𝐺 𝜎 𝑝 − 𝑞 2 𝐺 𝜎 𝑟
𝐼 𝑝 − 𝐼 𝑞 1
𝐼 𝑞

𝑁𝐿𝑀𝐹 𝐼 𝑝 =
1
𝑊
෍
𝑞
𝐺 𝜎 𝑉𝑝 − 𝑉𝑞 2
𝐼 𝑞
Output Value Representation
(ProbabilityDistribution)
TargetValue (Pixel)
vs AllValues(Pixel)
Inputs

PR-049 , Attention isAllYou Need

𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑄𝐾 𝑇
𝑑 𝑘
𝑉
Output Value
Representation
TargetValue (Query)
vs AllValues(Keys)
Inputs

𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄, 𝐾, 𝑉 = 𝑊 ∙ 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑ℎ
ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝑊𝑖𝑄, 𝐾𝑊𝑖𝐾, 𝑉𝑊𝑖𝑉)

𝑦𝑖 =
1
𝐶(𝑥)
෍
𝑗
𝑓 𝑥𝑖, 𝑥𝑗 𝑔(𝑥𝑗)
Output Value Representation
TargetValue (Pixel)
vs AllValues(Pixel)
Inputs

Another Representation of Input Local Pixels
= Weighted Sum of Local Pixels with Learned Filter
𝑦𝑖 =
1
𝐶(𝑥)
෍
𝑗∈3×3
𝑤𝑗 𝑔(𝑥𝑗)

Another Representation of Non-Local Pixels
= Weighted Sum of All Pixels with Similarity
𝑦𝑖 =
1
𝐶(𝑥)
෍
𝑗

+Learning…
𝑦𝑖 =
1
𝐶(𝑥)
෍
𝑗

𝑦𝑖 =
1
𝐶(𝑥)
෍
𝑗
• Gaussian 𝑓 𝑥𝑖, 𝑥𝑗 = exp(𝑥𝑖
𝑇
∙ 𝑥𝑗)
• Embedded Gaussian 𝑓 𝑥𝑖, 𝑥𝑗 = exp(𝜃(𝑥𝑖
𝑇
) ∙ 𝜙(𝑥𝑗))
• Dot Product 𝑓 𝑥𝑖, 𝑥𝑗 = 𝜃(𝑥𝑖
𝑇
) ∙ 𝜙(𝑥𝑗)
• Concatenation 𝑓 𝑥𝑖, 𝑥𝑗 = 𝑅𝑒𝐿𝑈(𝑤𝑓
𝑇
𝜃(𝑥𝑖) ∙ 𝜙(𝑥𝑗) )

𝑦𝑖 =
1
𝐶(𝑥)
෍
𝑗
𝑔 𝑥𝑗 = 𝑊𝑔 𝑥𝑗
For Feature Extraction

𝑦𝑖 =
1
σ 𝑗 exp(𝑥𝑖
𝑇
∙ 𝑥𝑗)
෍
𝑗
exp(𝑥𝑖
𝑇
∙ 𝑥𝑗) 𝑊𝑔 𝑥𝑗
Soft
max
HxWx1024
HxWx512 HWx512
HWx1024
1024xHW
HWxHW
HWx512
HxWx512
Reshape
1x1Conv
Operation

𝑦𝑖 =
1
σ 𝑗 exp(𝜃(𝑥𝑖
𝑇
) ∙ 𝜙(𝑥𝑗))
෍
𝑗
exp(𝜃(𝑥𝑖
𝑇
) ∙ 𝜙(𝑥𝑗))𝑊𝑔 𝑥𝑗
Soft
max
HxWx1024
HxWx512 HWx512
HWx512
512xHW
HWxHW
HWx512
HxWx512
Reshape
1x1Conv
Operation

𝑦𝑖 =
1
𝑁
෍
𝑗
𝜃(𝑥𝑖
𝑇
) ∙ 𝜙(𝑥𝑗) 𝑊𝑔 𝑥𝑗
1/N
HxWx1024
HxWx512 HWx512
HWx512
512xHW
HWxHW
HWx512
HxWx512
Reshape
1x1Conv
Operation

𝑦𝑖 =
1
𝑁
෍
𝑗
𝑅𝑒𝐿𝑈(𝑤𝑓
𝑇
𝜃(𝑥𝑖) ∙ 𝜙(𝑥𝑗) )𝑊𝑔 𝑥𝑗
1/N
HxWx1024
HxWx512 HWx512
HWx512 HWx512 HWx1024
HWx512
HxWx512
Reshape
1x1Conv
Operation
1024
xHW
+ReLU
HWxHW

HxWx1024 HxWx512
NL
Operation
HxWx1024
1x1Conv
+
Residual
𝑧𝑖 = 𝑊𝑧 𝑦𝑖 + 𝑥𝑖

+ Learning?
𝑦𝑖 =
1
𝐶(𝑥)
෍
𝑗

Recalibrate Features?
• Global Representation
• Global Context
• Long-range Dependencies
• Shorter Paths

Squeeze –and–Excitation Networks,CVPR 2018
Channel-wise Feature Recalibration
- SENet (ILSVRC 2017 Winner)
• 2 FC (Fully Connected) Layers Between Channels
• Excitation Layer Output (Representation) x Input = Output

• Squeeze-and-Excitation Networks (Channel-wise)
𝑥𝑖 / Learned Weights / 𝑥𝑗
• Self-Attention (Spatial)
Embedded 𝑥𝑖 / Similarity Weights / Embedded 𝑥𝑗 / Positional Encoding
• Non-local Neural Networks (Spatial/Temporal)
Embedded or Not 𝑥𝑖 / Similarity Weights / Embedded 𝑥𝑗

Layer Operation Repeat Output Size
Conv1 7x7, 64, s=2 64x112x112
Pool1 3x3, s=2 64x56x56
Res2 [1x1, 64 / 3x3, 64 / 1x1, 256]
+ [1x1, 256]
x3 256x56x56
Res3 [1x1, 128, s=2 / 3x3, 128 / 1x1, 512]
+ [1x1, 512]
x4 512x28x28
Res4 [1x1, 256, s=2 / 3x3, 256 / 1x1, 1024]
+ [1x1, 1024]
x6 1024x14x14
Res5 [1x1, 512, s=2 / 3x3, 512 / 1x1, 2048]
+ [1x1, 2048]
x3 2048x7x7
Pool2 7x7 2048
FC # of Category

3x224x224 64x112x112 64x56x56
7
x
7
P
o
o
l
1
x
1
3
x
3
1
x
1
1
x
1
+
Stride=2 Conv Stride=1 ConvStride=2 Pool
1
x
1
3
x
3
1
x
1
1
x
1
+
1
x
1
3
x
3
1
x
1
1
x
1
+
256x56x56
64
6464256
256
6464256
256
6464256
256

256x56x56 512x28x28
1
x
1
3
x
3
1
x
1
1
x
1
+
1
x
1
3
x
3
1
x
1
1
x
1
+
1
x
1
3
x
3
1
x
1
1
x
1
+
1
x
1
3
x
3
1
x
1
1
x
1
+
128128512
512

512x28x28
1
x
1
3
x
3
1
x
1
1
x
1
+
1
x
1
3
x
3
1
x
1
1
x
1
+
1
x
1
3
x
3
1
x
1
1
x
1
+
256256 1024
1024
1
x
1
3
x
3
1
x
1
1
x
1
+
1
x
1
3
x
3
1
x
1
1
x
1
+
1
x
1
3
x
3
1
x
1
1
x
1
+
1024x14x14

1
x
1
3
x
3
1
x
1
1
x
1
+
1
x
1
3
x
3
1
x
1
1
x
1
+
1
x
1
3
x
3
1
x
1
1
x
1
+
1024x14x14 2048x7x7
512 512 2048
2048
2048x1x1
1000
7
x
7
F
C

Conv1 7x7, 64, s=2 64x16x112x112
Pool1 3x3x3, s=2,2,2 64x8x56x56
Res2 [1x1, 64 / 3x3, 64 / 1x1, 256]
+ [1x1, 256]
x3 256x8x56x56
Pool_T 3x1x1, s=2,1,1 256x4x56x56
Res3 [1x1, 128, s=2 / 3x3, 128 / 1x1, 512]
+ [1x1, 512]
x4 512x4x28x28
Res4 [1x1, 256, s=2 / 3x3, 256 / 1x1, 1024]
+ [1x1, 1024]
x6 1024x4x14x14
Res5 [1x1, 512, s=2 / 3x3, 512 / 1x1, 2048]
+ [1x1, 2048]
x3 2048x4x7x7
Pool2 4x7x7 2048x1
FC # of Category

Conv1 5x7x7, 64, s=2 64x16x112x112
Pool1 3x3x3, s=2,2,2 64x8x56x56
Res2 [1x1, 64 / 3x3x3, 64 / 1x1, 256]
+ [1x1, 256]
x3 256x8x56x56
Pool_T 3x1x1, s=2,1,1 256x4x56x56
Res3 [1x1, 128, s=2 / 3x3x3, 128 / 1x1, 512]
+ [1x1, 512]
x4 512x4x28x28
Res4 [1x1, 256, s=2 / 3x3x3, 256 / 1x1, 1024]
+ [1x1, 1024]
x6 1024x4x14x14
Res5 [1x1, 512, s=2 / 3x3x3, 512 / 1x1, 2048]
+ [1x1, 2048]
x3 2048x4x7x7
Pool2 4x7x7 2048x1
FC # of Category

Conv1 5x7x7, 64, s=2 64x16x112x112
Pool1 3x3x3, s=2,2,2 64x8x56x56
Res2 [3x1x1, 64 / 3x3, 64 / 1x1, 256]
+ [1x1, 256]
x3 256x8x56x56
Pool_T 3x1x1, s=2,1,1 256x4x56x56
Res3 [3x1x1, 128, s=2 / 3x3, 128 / 1x1, 512]
+ [1x1, 512]
x4 512x4x28x28
Res4 [3x1x1, 256, s=2 / 3x3, 256 / 1x1, 1024]
+ [1x1, 1024]
x6 1024x4x14x14
Res5 [3x1x1, 512, s=2 / 3x3, 512 / 1x1, 2048]
+ [1x1, 2048]
x3 2048x4x7x7
Pool2 4x7x7 2048x1
FC # of Category

Pretrained 3x3
Copy x3
Devide Weights by1/3
Pretrained 1x1
Copy x3
Devide Weights by1/3
3x3x3
3x1x1
• 2D Conv Training  3D Conv Test

• Kinetics Dataset
• ~246k Videos (Train)
• 20k Videos (Validation)
• 400 Human Action Categories

• Add 1 Non-local Block
• Right before the last residual block of res4
• The Attentional Behavior is Not the Key to the Improvement
• Similarity + Learning >> Similarity (Gaussian)

• Similar
• Except res5 (Feature Map size is too Small)

• Long-range Multi-hop Communication
• Shallow 5-block ResNet50
> Deep baseline ResNet101
• Add NL Blocks
> Add Residual Blocks

• Add 5 NL Blocks
• Spacetime >> Space = Time >> Baseline

• NL C2D > I3D (3D ConvNet Baseline)
• Smaller Number of FLOPS

Pr083 Non-local Neural Networks

Pr083 Non-local Neural Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pr083 Non-local Neural Networks

Similar to Pr083 Non-local Neural Networks (20)

More from Taeoh Kim

More from Taeoh Kim (7)

Recently uploaded

Recently uploaded (20)

Pr083 Non-local Neural Networks