Improving Spatiotemporal Stability for Object Detection and Classification

Improving Spatiotemporal
Stability for Object Detection
and Classiﬁcation
Albert Y. C. Chen, Ph.D.
Computer Scientist @ Tandent Vision
2015/03/27

Videos, lots of them.
0
20
40
60
80
2007 2008 2009 2010 2011 2012
Hours of videos uploaded toYoutube every minute

Goal: automatically analyze,
organize, and archive videos.
Typical Approaches:
Classifiers, classifiers, classifiers
•Video nouns, e.g., sky, tree, building, car, etc.
•Video noun structures, e.g., horizontal flat surfaces,
vertical surfaces, non-support surfaces, etc.
•Video verbs, e.g., diving, bench press, punch.

Results are far from perfect
for example, in
Joint Segmentation and Classiﬁcation
(multiple semantic class pixel labeling)

Example annotations
Object segmentation Class segmentation
Difficult
objects
masked
Image

Example annotations
Object segmentation Class segmentationImage

State-of-the-art results from
PascalVOC 2012
Segmentation Challenge

Example segmentations
Image Ground truth
NUS_DET_SPR_GC_SPBONN_O2PCPMC_FGT_SEGM

Example Segmentations
Image Ground truth
NUS_DET_SPR_GC_SPBONN_O2PCPMC_FGT_SEGM

Apply these object classiﬁers
to videos, frame by frame?
Input
frame
Ground
truth
labels
2D MRF
results
00001TP_008820 00001TP_008850
VGS
results
00001TP_008880Name

Markov Random Field (MRF) for
modeling Spatiotemporal Priors
spatial
hidden
labels
observed
noisy
labels
temporal
ﬁrst order spatial
neighborhood
higher order spatial
neighborhood
temporal
neighborhood

Generic MRF Formulation for
classiﬁcation taks
E2 (mµ, m⌫) = 1 (mµ, m⌫)
E [{mµ : µ 2 G}]
=
X
µ2G
E1 (I (S [µ]) , mµ) +
X
hµ,⌫i
E2 (mµ, m⌫)
E1 (I (S [µ]) , mµ) = log P
⇣
mµ I (S [µ])
⌘

Major technical contributions, MRF
for modeling Spatiotemporal Priors
Name Application Description
Bilayer MRF
Video Label
Propagation
An additional layer of hidden
variables to model the motion
v.s. appearance model weights.
Higher Order
Proxy
Neighborhood
Joint segmentation
and classiﬁcation
Longer range spatial
smoothness with traditional 1st
order neighborhood.
Video Graph-
Shifts
Joint segmentation
and classiﬁcation
in videos
Simultaneously estimate the
motion priors while doing
multiple semantic class labeling.

Subproblem 1
Bootstrapping the Classiﬁer
Training process by using
Hierarchical Supervoxels

The inconsistent and time
consuming task of pixel labeling
Seq05VD_f02400Seq05VD_f02370Seq05VD_f02340
inputfram
e
sem
antic
object
label
roadsidewalk
sign
From the CambridgeVideo Driving Dataset

Video pixel label propagation
FG
Traditional Spatial
Propagation
Pixel label map
Label a subset of pixels
BG
Spatio-temporal Propagation
time

Bidirectional optical flow frame 20
Black & Anadan Classic+NL
Bidirectional optical flow frame 60
Black & Anadan Classic+NL
Maybe a different optical flow
algorithm?

Why optical ﬂow alone fails
a hole occurs the dragging effect
Forward Flow Reverse Flow
multiple
incoming
ﬂows
t t+1 t t+1

Train a appearance model on
the user annotated frame?
0
10
20
30
40
50
60
70
80
90
100
1 11 21 31 41 51 61 71
!"#$%&#'(#$)*#&+,-'!../$%.0'1"#$'23#'4#5/#-.#'
X:do-nothing
M:forward-flow
A:patch

Try again?
Motion-only Propagation Appearance-only Propagation50.00
55.00
60.00
65.00
70.00
75.00
80.00
85.00
90.00
95.00
100.00
1 11 21 31 41 51 61 71 81
!"#$%&#'(#$)*#&+,-'!../$%.0'1"#$'23#'4#5/#-.#'
X:do-nothing
M:forward-flow
A:patch

Maybe we should do
something like this?
app.
app.
flow
flow
both
both
both
both
flow
app.

Turns out to be an optical ﬂow
reliability estimation problem

How good is our Motion vs
Appearance (MvA) weights?
40
80
o. ﬂow only
The Container Sequence
input image GT label app. onlyour method
40
80
input image GT label our method o. ﬂow only app. only
The Garden Sequence

Well, there’s still problems-1
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71
How to Weigh between Mot and App?
Fixed weight for all pixel
Naïve cross-correlation
Occlusion-aware cross corr.
Bidirectional flow consistency

Well, there’s still problems-2
Initial Noisy WvA
weight map
Optimized WvA map
with our bilyaer MRF
bussoccer
Target frame
for propagation Ground Truth Label

Our bilayer MRF for Label
Propagation
Observed
noisy values
(Hidden true pixel labels)
(Hidden true WvA weights)
1st layer of MRF
2nd layer of MRF
label change at
causes to change
as well as causing the WvA
layer's energy to change
Our proposed Bilayer MRF for
Video Pixel Label Pixel Label Propagatoin

Results
frame 1 frame 75frame 50frame 25
stefan
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 11 21 31 41 51 61 71
Stefan (tenis) Sequence
Appearance uni-model
Appearance multi-model
Do nothing
Bidirectional flow
Bad (fixed) WvA weights
Our method

Results
soccer
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61
Soccer Sequence
Appearance multi-model
Do nothing
Bidirectional flow
Our method
Bad WvA weights

Results
bus
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 11 21 31 41 51 61 71
Bus Sequence
Do nothing
Bidirectional flow
Bad (fixed) MvA weight
Our method

Good, but far from perfect
• Overall accuracy still low
• Object Boundaries crossed
• Optical ﬂow reliability estimation still noisy

Hierarchical Supervoxel Fusion
(HSF) for Pixel Labeling
input video:
supervoxel
hierarchy
self-augmented
appearance model:
supervoxel ﬂow:
classiﬁer
HSF-based
Pixel-label
Propagation

What does HSF buy us?
• 100x more data for the appearance model.
• Supervoxel-level correspondences instead
of just pixel-level optical ﬂow.
• State-of-the-art pixel label propagation
performance.

Supervoxel Hierarchy and the
“right scale”

The HSF Process
y
Hierarchical Supervoxel
Fusion
x
t
Label Consistency Maps
Supervoxel
Hierarchy
y
x
t
vehicle ﬂower tree
y
x
t
input video:

Automatic Selection of the
Maximum Hierarchy Height
Soccer
grass tree face sign dog body
66x 83x 14x 28x 15x 62x
Stefan
grass face sign chair body
83x 1x 75x 1x 83x
Camvid
bldg tree/grasssky road pavemt. concr. roadmk.
6x 25x 1170x 176x 76x 20x 1756x
Table 3.1: Increase in training set size of the self-augmented training set (done
through Hierarchical Supervoxel Fusion) over the original training set.
Seq Lv 6 7 8 9 10 11 12 13 14 15
Bus 4.22% 6.11% 8.93% 9.44% 10.71% 18.57% 22.00% 27.55% 35.96% 47.36%
Container 0.08% 0.07% 0.16% 0.44% 0.86% 2.37% 3.28% 6.69% 14.11% 21.75%
Garden 0.83% 1.74% 2.66% 3.90% 6.21% 11.37% 20.12% 29.74% 30.43% 50.68%
Ice 0.11% 0.28% 0.89% 1.54% 1.99% 2.21% 2.32% 2.32% 2.41% 27.04%
Paris 0.38% 0.46% 0.73% 1.30% 2.02% 3.68% 9.02% 9.48% 11.32% 13.93%
Salesman 0.31% 0.46% 0.66% 1.58% 4.00% 7.18% 10.23% 20.99% 24.17% 25.01%
Soccer 0.29% 0.49% 0.61% 1.31% 1.57% 1.70% 5.43% 19.12% 33.89% 38.57%
Stefan 0.42% 0.74% 1.10% 1.38% 1.69% 1.91% 2.45% 3.97% 6.73% 39.70%
Camvid 1.72% 3.55% 6.23% 7.51% 11.06% 18.45% 25.84%
Table 3.2: Automatic Hierarchy Height Selection by computing the Supervoxel
Boundary Error on the user annotated frame. The shaded levels are discarded since
too many of the supervoxels violate the user-deﬁned boundaries.
Supervoxel boundary error on the user annotated frame.

The Self-augmented
Appearance Model
Bus
tree horse car ﬂower sign road
24x 3x 48x 33x 8x 18x
Container
bldg grass tree sky water road boat
91x 109x 93x 100x 90x 116x 89x
Garden
bldg tree sky ﬂower
96x 54x 31x 60x
Ice
face sign road body
37x 22x 89x 65x
Paris
tree face book body
113x 127x 105x 44x
Salesman
tree face book
111x 102x 84x
Soccer
grass tree face sign dog body
66x 83x 14x 28x 15x 62x
Stefan
grass face sign chair body
83x 1x 75x 1x 83x
Camvid
bldg tree/grasssky road pavemt. concr. roadmk.
6x 25x 1170x 176x 76x 20x 1756x
Table 3.1: Increase in training set size of the self-augmented training set (done
through Hierarchical Supervoxel Fusion) over the original training set.
Seq Lv 6 7 8 9 10 11 12 13 14 15
Bus 4.22% 6.11% 8.93% 9.44% 10.71% 18.57% 22.00% 27.55% 35.96% 47.36%
Container 0.08% 0.07% 0.16% 0.44% 0.86% 2.37% 3.28% 6.69% 14.11% 21.75%
Increase in the number of pixels available for
training the appearance model.

0.2$
0.3$
0.4$
0.5$
0.6$
1$ 11$ 21$ 31$ 41$ 51$ 61$ 71$
Bus$
Our$AG0SV$ OR0PA$ OR0SP$ OR0MM$
0.6$
0.65$
0.7$
0.75$
0.8$
0.85$
1$ 11$ 21$ 31$ 41$ 51$ 61$ 71$
Container)
Our$AG1SV$ OR1PA$ OR1SP$
0.4$
0.5$
0.6$
0.7$
0.8$
1$ 11$ 21$ 31$ 41$ 51$ 61$ 71$
Garden'
0.1$
0.3$
0.5$
0.7$
0.9$
1$ 11$ 21$ 31$ 41$ 51$ 61$ 71$
Ice$
The Self-augmented
Appearance Model

Supervoxel flow propagation
performance-1
0.2$
0.4$
0.6$
0.8$
1$ 11$ 21$ 31$ 41$ 51$ 61$ 71$ 81$
Bus$
HD.OF$ SF.BI.OF$ SVXL.flow$
0.2$
0.4$
0.6$
0.8$
1$
1$ 11$ 21$ 31$ 41$ 51$ 61$ 71$ 81$
Garden'
0.4$
0.5$
0.6$
0.7$
0.8$
0.9$
1$
1$ 11$ 21$ 31$ 41$ 51$ 61$ 71$
Ice$
HD/OF$ SF/BI/OF$ SVXL/flow$
0.5$
0.6$
0.7$
0.8$
0.9$
1$
1$ 11$ 21$ 31$ 41$ 51$ 61$ 71$ 81$
Container)

Supervoxel ﬂow propagation
performance-2
0.7$
0.8$
0.9$
1$
1$ 11$ 21$ 31$ 41$ 51$ 61$ 71$ 81$
Paris&
0.5$
0.6$
0.7$
0.8$
0.9$
1$
1$ 11$ 21$ 31$ 41$ 51$ 61$
Soccer&
0.5$
0.6$
0.7$
0.8$
0.9$
1$
1$ 11$ 21$ 31$ 41$ 51$ 61$ 71$ 81$
Salesman(
0.2$
0.4$
0.6$
0.8$
1$
1$ 11$ 21$ 31$ 41$ 51$ 61$ 71$
Stefan'

Finally, putting everything
together, our Hierarchical
Supervoxel Fusion-based Pixel
Label Propagation

Subproblem 2
Random Field Priors for
Improving the Spatiotemporal
Robustness of Classiﬁers

Problems with Traditional First
Order Neighborhood
µ
ν
ν
ν
ν
µ µ
µ µ

Higher-order Proxy Neighbors
µ
ν
ν
ν
ν
E [{mµn : µn
2 Gn
}] = 1
X
µn2Gn
E1 (µn
, mµn )
+ 2
X
µn2Gn
(
(µn
, mµn )
X
hµn,⌫ni

E2 (µn
, ⌫n
, mµn , m⌫n )
+ 0
2
X
⌫n2Gn
(⌫n
, mµn )
X
h⌫n
,⌧n
i
hµn
,⌫n
i
E2 (⌫n
, ⌧n
, mµn , m⌧n )
)

Energy Minimization via the
Graph-Shifts Algorithm
Shift
µ µν ν
P(ν) P(ν) P(µ)P(µ)

Recursive Computation of the Energy
E1 (µn
, mµn ) =
⇢
E1 (I (S [µn
]) , mµn ) if n = 0P
µn 12C(µn) E1 µn 1
, mµn 1 otherwise
E2 (µn
, ⌫n
, mµn , m⌫n ) =
8
><
>:
E2 (mµn , m⌫n ) if n = 0P
µn 1
2C(µn
)
⌫n 1
2C(⌫n
)
hµn 1
,⌫n 1
i
E2 µn 1
, ⌫n 1
, mµn 1 , m⌫n 1 otherwise
The overall energy, speciﬁed for level 0, is computed at
any level by: E [{mµn : µn
2 Gn
}] = 1
X
µn2Gn
E1 (µn
, mµn )
+ 2
X
µn2Gn

(µn
, mµn )
X
hµn,⌫ni
E2 (µn
, ⌫n
, mµn , m⌫n )
where (µn
, mµn ) =
D0
(µn
)
P
a2D0(µn)
P
ha,bi
⇣
An(a), An(b)
⌘

The Shift-Gradient is deﬁned as
E (mµn ! ˆmµn )
= E [{ ˆmµn : µn
2 Gn
}] E [{mµn : µn
2 Gn
}]
= 1 [E1 (µn
, ˆmµn ) E1 (µn
, mµn )]
+ 2
(
X
µn2Gn

(µn
, ˆmµn )
X
hµn,⌫ni
E2 (µn
, ⌫n
, ˆmµn , m⌫n )
X
µn2Gn

(µn
, mµn )
X
hµn,⌫ni
E2 (µn
, ⌫n
, mµn , m⌫n )
)
.

Visualizing the Graph-Shifts
Process and Hierarchy
Input Image lv. 1 lv. 2 lv. 3 lv. 4 lv. 5 lv. 6
The Hierarchy
Input Label shift #0 shift #20 shift #60shift #40
The Energy Minimization Process

Efficiency Improvements of
using HOPS
Input Ground Truth Classifier only First-order HOPS
Probability maps output by the classifier, and share by first-order and HOPS's E1 term:
void sky water road grass tree(s)
mountain animal/man building bridge
4830 shifts 3769 shifts
-22%
vehicle coastline

Efficiency Improvements of
using HOPS
Input Ground Truth Classifier only First-order HOPS
Probability maps output by the classifier, and share by first-order and HOPS's E1 term:
void sky water road grass tree(s)
2042 shifts 1868 shifts
-8.6%
mountain animal/man building bridge vehicle coastline

Qualitative Results of HOPS
on the MSRC-21 dataset
Legend
void building grass tree cow horse sheep sky mountain aeroplane water face
car bicycle flower sign bird book chair road cat dog body boat
Image
Labels
First
order
HOPS
Examples of HOPS outperforming first order neighborhood models Mislabeling by HOPS
Classifier
only

Qualitative Results of HOPS
on the LHI dataset
Examples of HOPS outperforming ﬁrst order neighborhood models Mislabeling by HOPS
void sky water road grass tree(s) mountain animal/man building bridge vehicle coastlineLegend
Image
Labels
First
order
HOPS
Classiﬁer
only

Quantitative Results on the
MSRC-21 and LHI datasets
Table 4.1: Comparison of overall accuracy rate on the LHI dataset
Classifier-Only First Order HOPS
Overall Accuracy 59.71 72.42 73.48
Improvement over classifier-only
overall accuracy
12.71 13.77
Percentage gained over first-order
neighborhood’s improvement
8.34%
Table 4.2: Comparison of overall accuracy rate on the MSRC dataset
overall accuracy
18.86 19.17%
1.64%
Table 4.1: Comparison of overall accuracy rate on the LHI dataset
overall accuracy
12.71 13.77
8.34%
Table 4.2: Comparison of overall accuracy rate on the MSRC dataset
overall accuracy
18.86 19.17%
1.64%
The optimum weights for the energy models are estimated (learned) during the train-
LHI
MSRC-21

Problems with existing ways of
modeling temporal priors
Doesn't model object motion
frame t-1
frame t
requires pre-computing of optical ﬂow
Initial temporal
link
Energy-reduced
temporal link
Shift
Overkill, computationally expensive
our video graph-shifts algorithm
frame t-1
frame t-1
frame t-1
frame t
frame t
frame t

Temporally Consistent Energy
Model
(µ, ⇢) =
(
0 if mµ 6= m⇢
exp( ↵||Xµ X⇢||p) otherwise
.
E[{mµ : µ 2 D}] = 1
X
µ2D
E1(I(S[µ]), mµ)
+ 2
X
hµ,⌫i
E2(mµ, m⌫) + 3
X
µ2D
Et(mµ, m⇢)
Et(mµ, m⇢) = 1 (µ, ⇢),
⇢ = argmin

||Xµ X||p,  2 {[⌘ : h0
, ⌘i} .

Overview of theVideo Graph-
Shifts Process
frame t-1 frame t
layer n layer n
layer n+1 layer n+1
Temporal
Correspondent
Change
Shift
µ

Experiments--The Buffalo
wintry driving dataset

Results sky (others) obstacles road
mjrd5_00003 mjrd5_00004 mjrd5_00005
Input
frame
Ground
truth
labels
Results:
without
temporal
links
Legend
name
Results:
with our
dynamic
temporal
links

Results
on the
Camvid
dataset
Input
frame
Ground
truth
labels
void building tree sky car sign
road pedestrian fence pole sidewalk bicyclist
Legend
00001TP_008820 00001TP_008850 00001TP_008880Name
Results:
without
temporal
links
Results:
with our
dynamic
temporal
links

Results
on the
Camvid
dataset
Input
frame
Ground
truth
labels
void building tree sky car sign
road pedestrian fence pole sidewalk bicyclist
Legend
Name Seq05VD_f01200 Seq05VD_f01230 Seq05VD_f01260
Results:
without
temporal
links
Results:
with our
dynamic
temporal
links

VGS
building 72.2 20.4 0.5 2.0 0.5 0.3 4.0
tree 16.4 79.9 1.4 1.0 0.3 0.9
sky 1.1 5.7 92.0 0.4 0.8
car 0.5 0.1 68.8 29.7 0.2 0.1 0.6
sign 80.3 7.1 12.6
road 2.4 93.0 4.6
pedestrian 9.9 16.7 0.1 18.9 1.1 25.0 0.4 12.9 12.8 2.2
fence 43.4 2.2 23.0 15.6 3.8 5.2 6.2 0.6
column 12.4 37.5 18.4 7.3 1.2 0.6 20.4 1.9 0.3
sidewalk 2.6 59.7 37.5 0.1
bicyclist 7.6 6.7 23.5 20.3 14.2 11.8 6.2 9.7
2D MRF
building 71.6 20.7 0.5 1.9 0.9 0.3 4.2
tree 17.0 78.3 1.5 0.9 0.7 0.1 1.5
sky 1.2 5.6 91.9 0.4 0.9
car 0.3 0.1 62.6 35.9 0.4 0.1 0.6 0.1
sign 76.0 8.2 15.8
road 1.6 93.7 4.7
pedestrian 12.5 23.3 0.1 19.4 2.3 17.7 0.8 6.6 16.3 0.8
fence 47.0 2.1 22.6 17.8 4.3 0.1 6.1
column 13.0 36.4 19.0 6.7 0.1 1.2 0.9 20.8 1.9
sidewalk 1.6 62.0 36.3
bicyclist 9.4 4.6 24.8 20.5 27.7 1.8 6.0 5.3
Results
on the
Camvid
dataset

Subproblem 3
Adapting the Learned
Classiﬁers to work in new
Domains

Motivation
• Similar images often share the same
parameter configuration for many
computer vision algorithms.
• Utilize this knowledge to develop meta-
classifiers (classifiers for classifiers).
• Utilize the local smoothness priors to
speed up the parameter space exploration,
as well as aid the adaptation process.

Objective function projection
Parameter
Space
xa
xb
xc
xd
Objective
Space
f(xa)
f(xb)f(xc)
f(xd)
f(x) = [f1(x),f2(x),…,fj(x)]:
unknown, non-linear,
non-convex function

Optimal Conﬁg. Exploration
Parameter Space
x1
x2
Objective Space
f(x1)
f(x2)
Pareto Front
x3
f(x3)
f()
1. Given two points f(x1), f(x2) in the objective space, determine
whether the unknown projection function f() is locally linear by
performing our SPEA2-LLP algorithm.
Objective Space
f(x1)
f(x2)
Pareto Front
f ’
2. If Dist( f ’, f(x3) ) is large, f() is
non-linear between f(x1), f(x2).
Break into smaller intervals and
do SPEA2-LLP until converge.
f(x3)
Dist(f ’, f(v3))
Objective Space
f(x1)
f(x2)
Pareto Front
f ’
f(x3)
Dist(f ’, f(x3))
3. If Dist( f ’, f(x3) ) is small,
sample a few more points
before concluding that f() is
linear between f(x1), f(x2).
f ’
x3 = w1x1+w2x2
f ’ = w1f(x1)+w2f(x2)i
xi
vi
f(xi)
f(xi)
xi xi
f(xi)

Earlier results-binarization
Using PIE to automatically determine the
binarization param. in a sliding window.
(PIE trained on a different randomly
selected separate from DIBCO2011)
Test Image: DIBCO 2009, P04
PIE result
One of the hand picked fixed parameter
binarization result. It cannot adapt to the
changing background intensity.
Hand picked fixed parameter result
Precision-recall of PIE (blue ◊) vs.
different fixed param. (red □)
Using a sliding window, using the previously learned
optimal parameter conﬁguration for every location.

Earlier results-binarization
Test Image: DIBCO 2009, H04
Using PIE to automatically determine the
binarization param. in a sliding window.
(PIE trained on a different randomly
selected separate from DIBCO2011)
Precision-recall of PIE (blue ◊) vs.
different fixed param. (red □)
Binarization Result Comparison
(prior to post-processing & noise removal)
One of the hand picked fixed parameter
binarization result. It cannot adapt to the
changing background intensity.

Earlier results
Segmentation on BSDS-500
σ=1.2,k=500
min=100
Input Image
Groundtruth
(one of the many) Our resultBad Inference
PFF default param.
σ=0.8, k=300
σ=0.22,k=688
min=167
σ=0.88,k=442
min=100
σ=0.6,k=500
min=600
σ=0.88,k=442
min=100
σ=0.5,k=500
min=800

Additional Results from using
the Parameter Inference Engine
(PIE) on other problems

Segmentation on the
Weizmann Horse Dataset

PIE as an Ensemble Combiner
PIE Equal Weights
Class Per-Class Precision.
(for 100 and 10,000
initial points)
Overall Average Ac-
curacy (for 100 and
10,000 initial points)
Per-Class
Precision
Overall
Average
Accuracy
Bass 70.97/76.67 80.56/82.41 58.82 74.07
Grand Piano 88.89/94.74 80.56/82.41 76.47 74.07
Minaret 100/100 79.63/82.41 96.43 74.07
Soccer Ball 83.33/80.77 81.48/83.33 68.97 74.07
Average 85.80/88.04 80.56/82.64 75.17 74.07
Average PIE Im-
provements (%)
14.13/17.12 8.75/11.56
Table 6.1: Results 1
• Random forest with 100 randomized trees, binary
test at each node, and learned by maximum
information gain on a dictionary of 1024 quantized
SIFT feature vectors.
4 class subset from Caltech 101, 15 training per class

PIE as an Ensemble Combiner
• aa
Average PIE Im-
provements (%)
14.13/17.12 8.75/11.56
Table 6.1: Results 1
PIE Equal Weights
Class Per-Class Precision.
(for 100 and 10,000
initial points)
Overall Average Ac-
curacy (for 100 and
10,000 initial points)
Per-Class
Precision
Overall
Average
Accuracy
Faces 71.33/71,67 60.82/60.60 70.71 58.83
airplanes 74.88/73.36 60.49/60.60 68.38 58.83
anchor 9.52/16.67 60.38/60.49 5.00 58.83
ant 34.78/50.00 60.26/60.15 28.57 58.83
barrel 35.71/63.64 60.71/60.60 18.19 58.83
bass 31.82/23.33 60.49/60.38 16.13 58.83
beaver 20.69/23.53 60.93/60.26 18.37 58.83
binocular 58.82/61.11 60.26/60.60 47.37 58.83
bonsai 69.23/64.29 60.26/60.60 50.00 58.83
brain 70.97/69.01 60.04/60.71 59.52 58.83
brontosaurus 100/100 60.04/60.60 0.00 58.83
car side 59.42/62.40 60.49/60.71 57.35 58.83
Average 53.10/56.58 60.43/60.52 36.63 58.83
Avg. PIE Im-
provements (%)
44.95/54.47 2.72/2.88
Table 6.2: Results 212 class subset from Caltech 101, 15 training per class

Conclusion
• Spatiotemporal priors for pixel label
propagation in space-time volumes: Bilayer
MRF and HSF based propagation.
• HOPS for longer range spatial modeling,
VGS for dynamic temporal modeling.
• PIE for utilizing the localness priors to
explore & adapt parameter conﬁgurations.
• Full potential of spatiotemporal priors still
frequently overlooked.

Publications
1. W.Wu,A.Y. C. Chen, L. Zhao, and J. J. Corso. Brain tumor detection and segmentation in a CRF framework with pixel-wise
afﬁnity and superpixel-level features. International Journal of Computer Assisted Radiology and Surgery, 2015.
2. S. N. Lim,A.Y. C. Chen and X.Yang. Parameter Inference Engine (PIE) on the Pareto Front. In Proceedings of International
Conference of Machine Learning,Auto ML Workshop, 2014.
3. A.Y. C. Chen, S.Whitt, C. Xu, and J. J. Corso. Hierarchical supervoxel fusion for robust pixel label propagation in videos. In
Submission to ACM Multimedia, 2013.
4. A.Y.C. Chen and J.J. Corso.Temporally consistent multi-class video-object segmentation with the video graph-shifts
algorithm. In Proceedings of IEEE Workshop on Applications of ComputerVision, 2011.
5. D.R. Schlegel,A.Y.C. Chen, C. Xiong, J.A. Delmerico, and J.J. Corso. Airtouch: Interacting with computer systems at a
distance. In Proceedings of IEEE Workshop on Applications of ComputerVision, 2011.
6. A.Y.C. Chen and J.J. Corso. On the effects of normalization in adaptive MRF Hierarchies. In Proceedings of International
Symposium CompIMAGE, 2010.
7. A.Y.C. Chen and J.J. Corso. Propagating multi-class pixel labels throughout video frames. In Proceedings of IEEE Western
NewYork Image Processing Workshop, 2010.
8. A.Y. C. Chen and J. J. Corso. On the effects of normalization in adaptive MRF Hierarchies. Computational Modeling of
Objects Represented in Images, pages 275–286, 2010.
9. Y.Tao, L. Lu, M. Dewan,A.Y. C. Chen, J. J. Corso, J. Xuan, M. Salganicoff, and A. Krishnan. Multi-level ground glass nodule
detection and segmentation in ct lung images. Medical Image Computing and Computer-Assisted Intervention, 2009.
10. A.Y.C. Chen, J.J. Corso, and L.Wang. Hops: Efﬁcient region labeling using higher order proxy neighborhoods. In
Proceedings of IEEE International Conference on Pattern Recognition, 2008.

Improving Spatiotemporal Stability for Object Detection and Classification

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Improving Spatiotemporal Stability for Object Detection and Classification

Ähnlich wie Improving Spatiotemporal Stability for Object Detection and Classification (20)

Mehr von Albert Y. C. Chen

Mehr von Albert Y. C. Chen (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Improving Spatiotemporal Stability for Object Detection and Classification