12. Apply these object classifiers
to videos, frame by frame?
Input
frame
Ground
truth
labels
2D MRF
results
00001TP_008820 00001TP_008850
VGS
results
00001TP_008880Name
13. Markov Random Field (MRF) for
modeling Spatiotemporal Priors
spatial
hidden
labels
observed
noisy
labels
temporal
first order spatial
neighborhood
higher order spatial
neighborhood
temporal
neighborhood
14. Generic MRF Formulation for
classification taks
E2 (mµ, m⌫) = 1 (mµ, m⌫)
E [{mµ : µ 2 G}]
=
X
µ2G
E1 (I (S [µ]) , mµ) +
X
hµ,⌫i
E2 (mµ, m⌫)
E1 (I (S [µ]) , mµ) = log P
⇣
mµ I (S [µ])
⌘
15. Major technical contributions, MRF
for modeling Spatiotemporal Priors
Name Application Description
Bilayer MRF
Video Label
Propagation
An additional layer of hidden
variables to model the motion
v.s. appearance model weights.
Higher Order
Proxy
Neighborhood
Joint segmentation
and classification
Longer range spatial
smoothness with traditional 1st
order neighborhood.
Video Graph-
Shifts
Joint segmentation
and classification
in videos
Simultaneously estimate the
motion priors while doing
multiple semantic class labeling.
17. The inconsistent and time
consuming task of pixel labeling
Seq05VD_f02400Seq05VD_f02370Seq05VD_f02340
inputfram
e
sem
antic
object
label
roadsidewalk
sign
From the CambridgeVideo Driving Dataset
18. Video pixel label propagation
FG
Traditional Spatial
Propagation
Pixel label map
Label a subset of pixels
BG
Spatio-temporal Propagation
time
20. Bidirectional optical flow frame 20
Black & Anadan Classic+NL
Bidirectional optical flow frame 60
Black & Anadan Classic+NL
Maybe a different optical flow
algorithm?
21. Why optical flow alone fails
a hole occurs the dragging effect
Forward Flow Reverse Flow
multiple
incoming
flows
t t+1 t t+1
22. Train a appearance model on
the user annotated frame?
0
10
20
30
40
50
60
70
80
90
100
1 11 21 31 41 51 61 71
!"#$%&#'(#$)*#&+,-'!../$%.0'1"#$'23#'4#5/#-.#'
X:do-nothing
M:forward-flow
A:patch
24. Maybe we should do
something like this?
app.
app.
flow
flow
both
both
both
both
flow
app.
25. Turns out to be an optical flow
reliability estimation problem
26. How good is our Motion vs
Appearance (MvA) weights?
40
80
o. flow only
The Container Sequence
input image GT label app. onlyour method
40
80
input image GT label our method o. flow only app. only
The Garden Sequence
27. Well, there’s still problems-1
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71
How to Weigh between Mot and App?
Fixed weight for all pixel
Naïve cross-correlation
Occlusion-aware cross corr.
Bidirectional flow consistency
28. Well, there’s still problems-2
Initial Noisy WvA
weight map
Optimized WvA map
with our bilyaer MRF
bussoccer
Target frame
for propagation Ground Truth Label
29. Our bilayer MRF for Label
Propagation
Observed
noisy values
(Hidden true pixel labels)
(Hidden true WvA weights)
1st layer of MRF
2nd layer of MRF
label change at
causes to change
as well as causing the WvA
layer's energy to change
Our proposed Bilayer MRF for
Video Pixel Label Pixel Label Propagatoin
35. What does HSF buy us?
• 100x more data for the appearance model.
• Supervoxel-level correspondences instead
of just pixel-level optical flow.
• State-of-the-art pixel label propagation
performance.
38. The HSF Process
y
Hierarchical Supervoxel
Fusion
x
t
Label Consistency Maps
Supervoxel
Hierarchy
y
x
t
vehicle flower tree
y
x
t
input video:
39. Automatic Selection of the
Maximum Hierarchy Height
Soccer
grass tree face sign dog body
66x 83x 14x 28x 15x 62x
Stefan
grass face sign chair body
83x 1x 75x 1x 83x
Camvid
bldg tree/grasssky road pavemt. concr. roadmk.
6x 25x 1170x 176x 76x 20x 1756x
Table 3.1: Increase in training set size of the self-augmented training set (done
through Hierarchical Supervoxel Fusion) over the original training set.
Seq Lv 6 7 8 9 10 11 12 13 14 15
Bus 4.22% 6.11% 8.93% 9.44% 10.71% 18.57% 22.00% 27.55% 35.96% 47.36%
Container 0.08% 0.07% 0.16% 0.44% 0.86% 2.37% 3.28% 6.69% 14.11% 21.75%
Garden 0.83% 1.74% 2.66% 3.90% 6.21% 11.37% 20.12% 29.74% 30.43% 50.68%
Ice 0.11% 0.28% 0.89% 1.54% 1.99% 2.21% 2.32% 2.32% 2.41% 27.04%
Paris 0.38% 0.46% 0.73% 1.30% 2.02% 3.68% 9.02% 9.48% 11.32% 13.93%
Salesman 0.31% 0.46% 0.66% 1.58% 4.00% 7.18% 10.23% 20.99% 24.17% 25.01%
Soccer 0.29% 0.49% 0.61% 1.31% 1.57% 1.70% 5.43% 19.12% 33.89% 38.57%
Stefan 0.42% 0.74% 1.10% 1.38% 1.69% 1.91% 2.45% 3.97% 6.73% 39.70%
Camvid 1.72% 3.55% 6.23% 7.51% 11.06% 18.45% 25.84%
Table 3.2: Automatic Hierarchy Height Selection by computing the Supervoxel
Boundary Error on the user annotated frame. The shaded levels are discarded since
too many of the supervoxels violate the user-defined boundaries.
Supervoxel boundary error on the user annotated frame.
40. The Self-augmented
Appearance Model
Bus
tree horse car flower sign road
24x 3x 48x 33x 8x 18x
Container
bldg grass tree sky water road boat
91x 109x 93x 100x 90x 116x 89x
Garden
bldg tree sky flower
96x 54x 31x 60x
Ice
face sign road body
37x 22x 89x 65x
Paris
tree face book body
113x 127x 105x 44x
Salesman
tree face book
111x 102x 84x
Soccer
grass tree face sign dog body
66x 83x 14x 28x 15x 62x
Stefan
grass face sign chair body
83x 1x 75x 1x 83x
Camvid
bldg tree/grasssky road pavemt. concr. roadmk.
6x 25x 1170x 176x 76x 20x 1756x
Table 3.1: Increase in training set size of the self-augmented training set (done
through Hierarchical Supervoxel Fusion) over the original training set.
Seq Lv 6 7 8 9 10 11 12 13 14 15
Bus 4.22% 6.11% 8.93% 9.44% 10.71% 18.57% 22.00% 27.55% 35.96% 47.36%
Container 0.08% 0.07% 0.16% 0.44% 0.86% 2.37% 3.28% 6.69% 14.11% 21.75%
Increase in the number of pixels available for
training the appearance model.
50. Recursive Computation of the Energy
E1 (µn
, mµn ) =
⇢
E1 (I (S [µn
]) , mµn ) if n = 0P
µn 12C(µn) E1 µn 1
, mµn 1 otherwise
E2 (µn
, ⌫n
, mµn , m⌫n ) =
8
><
>:
E2 (mµn , m⌫n ) if n = 0P
µn 1
2C(µn
)
⌫n 1
2C(⌫n
)
hµn 1
,⌫n 1
i
E2 µn 1
, ⌫n 1
, mµn 1 , m⌫n 1 otherwise
The overall energy, specified for level 0, is computed at
any level by: E [{mµn : µn
2 Gn
}] = 1
X
µn2Gn
E1 (µn
, mµn )
+ 2
X
µn2Gn
(µn
, mµn )
X
hµn,⌫ni
E2 (µn
, ⌫n
, mµn , m⌫n )
where (µn
, mµn ) =
D0
(µn
)
P
a2D0(µn)
P
ha,bi
⇣
An(a), An(b)
⌘
51. The Shift-Gradient is defined as
E (mµn ! ˆmµn )
= E [{ ˆmµn : µn
2 Gn
}] E [{mµn : µn
2 Gn
}]
= 1 [E1 (µn
, ˆmµn ) E1 (µn
, mµn )]
+ 2
(
X
µn2Gn
(µn
, ˆmµn )
X
hµn,⌫ni
E2 (µn
, ⌫n
, ˆmµn , m⌫n )
X
µn2Gn
(µn
, mµn )
X
hµn,⌫ni
E2 (µn
, ⌫n
, mµn , m⌫n )
)
.
52. Visualizing the Graph-Shifts
Process and Hierarchy
Input Image lv. 1 lv. 2 lv. 3 lv. 4 lv. 5 lv. 6
The Hierarchy
Input Label shift #0 shift #20 shift #60shift #40
The Energy Minimization Process
53. Efficiency Improvements of
using HOPS
Input Ground Truth Classifier only First-order HOPS
Probability maps output by the classifier, and share by first-order and HOPS's E1 term:
void sky water road grass tree(s)
mountain animal/man building bridge
4830 shifts 3769 shifts
-22%
vehicle coastline
54. Efficiency Improvements of
using HOPS
Input Ground Truth Classifier only First-order HOPS
Probability maps output by the classifier, and share by first-order and HOPS's E1 term:
void sky water road grass tree(s)
2042 shifts 1868 shifts
-8.6%
mountain animal/man building bridge vehicle coastline
55. Qualitative Results of HOPS
on the MSRC-21 dataset
Legend
void building grass tree cow horse sheep sky mountain aeroplane water face
car bicycle flower sign bird book chair road cat dog body boat
Image
Labels
First
order
HOPS
Examples of HOPS outperforming first order neighborhood models Mislabeling by HOPS
Classifier
only
56. Qualitative Results of HOPS
on the LHI dataset
Examples of HOPS outperforming first order neighborhood models Mislabeling by HOPS
void sky water road grass tree(s) mountain animal/man building bridge vehicle coastlineLegend
Image
Labels
First
order
HOPS
Classifier
only
57. Quantitative Results on the
MSRC-21 and LHI datasets
Table 4.1: Comparison of overall accuracy rate on the LHI dataset
Classifier-Only First Order HOPS
Overall Accuracy 59.71 72.42 73.48
Improvement over classifier-only
overall accuracy
12.71 13.77
Percentage gained over first-order
neighborhood’s improvement
8.34%
Table 4.2: Comparison of overall accuracy rate on the MSRC dataset
Classifier-Only First Order HOPS
Overall Accuracy 55.87 74.73 75.04
Improvement over classifier-only
overall accuracy
18.86 19.17%
Percentage gained over first-order
neighborhood’s improvement
1.64%
Table 4.1: Comparison of overall accuracy rate on the LHI dataset
Classifier-Only First Order HOPS
Overall Accuracy 59.71 72.42 73.48
Improvement over classifier-only
overall accuracy
12.71 13.77
Percentage gained over first-order
neighborhood’s improvement
8.34%
Table 4.2: Comparison of overall accuracy rate on the MSRC dataset
Classifier-Only First Order HOPS
Overall Accuracy 55.87 74.73 75.04
Improvement over classifier-only
overall accuracy
18.86 19.17%
Percentage gained over first-order
neighborhood’s improvement
1.64%
The optimum weights for the energy models are estimated (learned) during the train-
LHI
MSRC-21
58. Problems with existing ways of
modeling temporal priors
Doesn't model object motion
frame t-1
frame t
requires pre-computing of optical flow
Initial temporal
link
Energy-reduced
temporal link
Shift
Overkill, computationally expensive
our video graph-shifts algorithm
frame t-1
frame t-1
frame t-1
frame t
frame t
frame t
68. Motivation
• Similar images often share the same
parameter configuration for many
computer vision algorithms.
• Utilize this knowledge to develop meta-
classifiers (classifiers for classifiers).
• Utilize the local smoothness priors to
speed up the parameter space exploration,
as well as aid the adaptation process.
70. Optimal Config. Exploration
Parameter Space
x1
x2
Objective Space
f(x1)
f(x2)
Pareto Front
x3
f(x3)
f()
1. Given two points f(x1), f(x2) in the objective space, determine
whether the unknown projection function f() is locally linear by
performing our SPEA2-LLP algorithm.
Objective Space
f(x1)
f(x2)
Pareto Front
f ’
2. If Dist( f ’, f(x3) ) is large, f() is
non-linear between f(x1), f(x2).
Break into smaller intervals and
do SPEA2-LLP until converge.
f(x3)
Dist(f ’, f(v3))
Objective Space
f(x1)
f(x2)
Pareto Front
f ’
f(x3)
Dist(f ’, f(x3))
3. If Dist( f ’, f(x3) ) is small,
sample a few more points
before concluding that f() is
linear between f(x1), f(x2).
f ’
x3 = w1x1+w2x2
f ’ = w1f(x1)+w2f(x2)i
xi
vi
f(xi)
f(xi)
xi xi
f(xi)
71. Earlier results-binarization
Using PIE to automatically determine the
binarization param. in a sliding window.
(PIE trained on a different randomly
selected separate from DIBCO2011)
Test Image: DIBCO 2009, P04
PIE result
One of the hand picked fixed parameter
binarization result. It cannot adapt to the
changing background intensity.
Hand picked fixed parameter result
Precision-recall of PIE (blue ◊) vs.
different fixed param. (red □)
Using a sliding window, using the previously learned
optimal parameter configuration for every location.
72. Earlier results-binarization
Test Image: DIBCO 2009, H04
Using PIE to automatically determine the
binarization param. in a sliding window.
(PIE trained on a different randomly
selected separate from DIBCO2011)
Precision-recall of PIE (blue ◊) vs.
different fixed param. (red □)
Binarization Result Comparison
(prior to post-processing & noise removal)
One of the hand picked fixed parameter
binarization result. It cannot adapt to the
changing background intensity.
77. PIE as an Ensemble Combiner
PIE Equal Weights
Class Per-Class Precision.
(for 100 and 10,000
initial points)
Overall Average Ac-
curacy (for 100 and
10,000 initial points)
Per-Class
Precision
Overall
Average
Accuracy
Bass 70.97/76.67 80.56/82.41 58.82 74.07
Grand Piano 88.89/94.74 80.56/82.41 76.47 74.07
Minaret 100/100 79.63/82.41 96.43 74.07
Soccer Ball 83.33/80.77 81.48/83.33 68.97 74.07
Average 85.80/88.04 80.56/82.64 75.17 74.07
Average PIE Im-
provements (%)
14.13/17.12 8.75/11.56
Table 6.1: Results 1
• Random forest with 100 randomized trees, binary
test at each node, and learned by maximum
information gain on a dictionary of 1024 quantized
SIFT feature vectors.
4 class subset from Caltech 101, 15 training per class
78. PIE as an Ensemble Combiner
• aa
Average PIE Im-
provements (%)
14.13/17.12 8.75/11.56
Table 6.1: Results 1
PIE Equal Weights
Class Per-Class Precision.
(for 100 and 10,000
initial points)
Overall Average Ac-
curacy (for 100 and
10,000 initial points)
Per-Class
Precision
Overall
Average
Accuracy
Faces 71.33/71,67 60.82/60.60 70.71 58.83
airplanes 74.88/73.36 60.49/60.60 68.38 58.83
anchor 9.52/16.67 60.38/60.49 5.00 58.83
ant 34.78/50.00 60.26/60.15 28.57 58.83
barrel 35.71/63.64 60.71/60.60 18.19 58.83
bass 31.82/23.33 60.49/60.38 16.13 58.83
beaver 20.69/23.53 60.93/60.26 18.37 58.83
binocular 58.82/61.11 60.26/60.60 47.37 58.83
bonsai 69.23/64.29 60.26/60.60 50.00 58.83
brain 70.97/69.01 60.04/60.71 59.52 58.83
brontosaurus 100/100 60.04/60.60 0.00 58.83
car side 59.42/62.40 60.49/60.71 57.35 58.83
Average 53.10/56.58 60.43/60.52 36.63 58.83
Avg. PIE Im-
provements (%)
44.95/54.47 2.72/2.88
Table 6.2: Results 212 class subset from Caltech 101, 15 training per class
79. Conclusion
• Spatiotemporal priors for pixel label
propagation in space-time volumes: Bilayer
MRF and HSF based propagation.
• HOPS for longer range spatial modeling,
VGS for dynamic temporal modeling.
• PIE for utilizing the localness priors to
explore & adapt parameter configurations.
• Full potential of spatiotemporal priors still
frequently overlooked.
80. Publications
1. W.Wu,A.Y. C. Chen, L. Zhao, and J. J. Corso. Brain tumor detection and segmentation in a CRF framework with pixel-wise
affinity and superpixel-level features. International Journal of Computer Assisted Radiology and Surgery, 2015.
2. S. N. Lim,A.Y. C. Chen and X.Yang. Parameter Inference Engine (PIE) on the Pareto Front. In Proceedings of International
Conference of Machine Learning,Auto ML Workshop, 2014.
3. A.Y. C. Chen, S.Whitt, C. Xu, and J. J. Corso. Hierarchical supervoxel fusion for robust pixel label propagation in videos. In
Submission to ACM Multimedia, 2013.
4. A.Y.C. Chen and J.J. Corso.Temporally consistent multi-class video-object segmentation with the video graph-shifts
algorithm. In Proceedings of IEEE Workshop on Applications of ComputerVision, 2011.
5. D.R. Schlegel,A.Y.C. Chen, C. Xiong, J.A. Delmerico, and J.J. Corso. Airtouch: Interacting with computer systems at a
distance. In Proceedings of IEEE Workshop on Applications of ComputerVision, 2011.
6. A.Y.C. Chen and J.J. Corso. On the effects of normalization in adaptive MRF Hierarchies. In Proceedings of International
Symposium CompIMAGE, 2010.
7. A.Y.C. Chen and J.J. Corso. Propagating multi-class pixel labels throughout video frames. In Proceedings of IEEE Western
NewYork Image Processing Workshop, 2010.
8. A.Y. C. Chen and J. J. Corso. On the effects of normalization in adaptive MRF Hierarchies. Computational Modeling of
Objects Represented in Images, pages 275–286, 2010.
9. Y.Tao, L. Lu, M. Dewan,A.Y. C. Chen, J. J. Corso, J. Xuan, M. Salganicoff, and A. Krishnan. Multi-level ground glass nodule
detection and segmentation in ct lung images. Medical Image Computing and Computer-Assisted Intervention, 2009.
10. A.Y.C. Chen, J.J. Corso, and L.Wang. Hops: Efficient region labeling using higher order proxy neighborhoods. In
Proceedings of IEEE International Conference on Pattern Recognition, 2008.