論文紹介 Fast imagetagging

Fast Image Tagging
M. Chen(Amazon.com), A. Zheng(MSR, Redmond), and K. Weinberger(Washington Univ.)
ICML2013
ICML2013読み会 2013.7.9
株式会社Preferred Infrastructure
Takashi Abe <tabe@preferred.jp>

自己紹介
 阿部厳（あべたかし）
 Twitter: @tabe2314
 東北大岡谷研（コンピュータビジョン）→PFIインターン→PFI新入
社員
2

紹介する論文
M. Chen, A. Zheng and K. Weinberger. Fast Image Tagging. ICML, 2013.
※ スライド中の図表はこの論文より引用しています
3

Image Tagging (1)
4
 画像から、関連するタグ（複数）を推定
 training:
 入力: {(画像, タグ), …} 出力: 画像→タグ集合
 testing:
 入力: 画像出力: 推定したタグ集合
bear polar snow tundra buildings clothes shops
street
???
training testing

Image Tagging (2): 何が難しい？
 効果的な特徴が物体によって違う → いろんな特徴を入れたい
 見えの多様性 → 大きなデータセットを使いたい
 不完全なアノテーションデータ
 PrecisionはともかくRecallが低いデータしか得られない
 （本来のタグから一部が抜け落ちたデータが得られる）
 例: Flickrのタグ
 タグの出現頻度の偏り
5
Color Edges

基本的なアイデア
 アノテーションされたタグ集合を補完しつつ、画像から補完された
タグへの（線形の）マッピングを学習
 B: タグ集合 → 補完されたタグ集合
 W: 画像特徴 → 補完されたタグ集合
 学習:
7
gure 1. Schematic illustration of FastTag. During training two classiﬁers B and W
edict similar results. At testing time, a simple linear mapping x → W x predicts t
ntly convex and has closed form solutions in each
eration of the optimization.
o-regularized learning. As we are only provided
th an incomplete set of tags, wecreate an additional
xiliary problem and obtain two sub-tasks: 1) train-
g an image classiﬁer xi → W xi that predicts the
mplete tag set from image features, and 2) training
mapping yi → Byi to enrich the existing sparse
g vector yi by estimating which tags are likely to
-occur with those already in yi . We train both clas-
ers simultaneously and force their output to agree
minimizing
1
n
n
i = 1
Byi − W xi
2
. (1)
B to y would recov
set.
For simplicity, we u
ondary corruption m
belers may select ta
bility. We can appr
distribution with pie
learning step (see se
the original corrupti
also easily be incorp
More formally, for e
created by randomly
each entry in y with
fore, for each user
p(˜yt = 0) = p and

Marginalized blank-out regularization (1)
 を単純に最小化するとB=0=Wなので要制約
 Bはアノテーションyiを真のタグ集合ziにマップして欲しい
 ziは得られないので、yiの要素をそれぞれ確率pで落としたから yi
を復元するBを考える
 の生成を繰り返し行うことを考えると復元誤差の期待値は
8
or yi by estimating which tags are likely to
r with those already in yi . We train both clas-
multaneously and force their output to agree
mizing
1
n
n
i = 1
Byi − W xi
2
. (1)
yi is the enriched tag set for the i-th training
and each row of W contains the weights of a
assifier that tries to predict the corresponding
ed) tag based on image features.
s function as currently written has a trivial so-
t B = 0 = W , suggesting that the current for-
n is underconstrained. We next describe ad-
regularizations on B that guides the solution
something more useful.
alized blank-out regularizat ion. We take
on from the idea of marginalized stacked de-
autoencoders (Chen et al., 2012) and related
?) in formulating the tag enrichment mapping
} T
→RT
. Our intention isto enrich theincom-
the original corruption mechanism is
also easily be incorporated into our m
More formally, for each y, a corrup
created by randomly removing (i.e.,
each entry in y with some probability
fore, for each user tag vector y an
p(˜yt = 0) = p and p(˜yt = yt ) = 1 −
to optimize
B = argmin
B
1
n
n
i = 1
yi − B
Here, each row of B is an ordinary
gressor that predicts the presence o
existing tags in ˜y. To reduce varian
repeated samples of ˜y. In the limi
many corrupted versions of y), the
struction error under the corrupting
be expressed as
r(B) =
1
n
n
i = 1
E yi − B ˜yi
cts the
raining
sparse
kely to
th clas-
o agree
(1)
raining
hts of a
ponding
vial so-
ent for-
ibe ad-
solution
bility. We can approximate the unknown corrupting
distribution with piecewise uniform corruption in the
learning step (see section 3.2). If prior knowledge on
the original corruption mechanism is available, it can
also easily be incorporated into our model.
More formally, for each y, a corrupted version ˜y is
created by randomly removing (i.e., setting to zero)
each entry in y with some probability p≥ 0 and there-
fore, for each user tag vector y and dimensions t,
p(˜yt = 0) = p and p(˜yt = yt ) = 1 − p. We train B
to optimize
B = argmin
B
1
n
n
i = 1
yi − B ˜yi
2
.
Here, each row of B is an ordinary least squares re-
gressor that predicts the presence of a tag given all
existing tags in ˜y. To reduce variance in B, we take
repeated samples of ˜y. In the limit (with infinitely
many corrupted versions of y), the expected recon-
(1)
ning
of a
ding
so-
for-
ad-
tion
ake
de-
ated
ping
om-
hat
each entry in y with some probability p≥ 0 and there-
fore, for each user tag vector y and dimensions t,
p(˜yt = 0) = p and p(˜yt = yt ) = 1 − p. We train B
to optimize
B = argmin
B
1
n
n
i = 1
yi − B ˜yi
2
.
Here, each row of B is an ordinary least squares re-
gressor that predicts the presence of a tag given all
existing tags in ˜y. To reduce variance in B, we take
repeated samples of ˜y. In the limit (with infinitely
many corrupted versions of y), the expected recon-
struction error under the corrupting distribution can
be expressed as
r(B) =
1
n
n
i = 1
E yi − B ˜yi
2
p( ˜y i |y )
. (2)
Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-

Marginalized blank-out regularization (2)
 (2)を式変形して
 つまり実際にを作る必要は無い（pだけ決めればよい）
 この復元誤差の期待値をロス関数に加える
9
com-
that
that
m the
of the
e en-
ocess.
on of
from
tches
lying
Let us denote as Y ≡ [y1, · · · , yn ] the matrix contain-
ing the partial labels for each image in each column.
Deﬁne P ≡
n
i = 1 yi E[˜yi ] and Q ≡
n
i = 1 E[˜yi ˜yi ],
then we can rewrite the loss in (2) as
r(B) =
1
n
trace(BQB − 2PB + Y Y ) (3)
We use Eq. (3) to regularize B. For the uniform
“ blank-out” noise introduced above, we have the ex-
pected value of the corruptions E[˜y]p( ˜y |y ) = (1 − p)y,

Optimization
 Bを固定すると(5)を最小化するWは閉じた式で
 同様にWを固定すると
 交互に求めると大域解に収束 (jointly convex)
10

拡張
 Tag bootstrapping
 Bはタグの共起関係で学習されてるので、共起しないけど似たタグ
が補完されない（例: lakeとpond）
 stacking
 Byiを新たなアノテーションとしてもう一度学習、を繰り返す
 共起関係を伝搬させるイメージ？
 スタック数は実験的に決定
11

画像特徴
 複数の特徴を組み合わせ（既存手法と同じもの）
 GIST
 6種類の色ヒストグラム
 8種類の局所特徴のBoW
 事前に内積がχ^2距離を近似する空間にあらかじめ写像しておく
 Vedaldi, A. and Zisserman, A. Efficient additive kernels via explicit
feature maps. PAMI, 34(3):480–492, 2012.
12

精度評価
 leastSquares: FastTagのタグ補完無し版、ベースライン
 TagProp: これまでのstate of the art, 学習O(n^2), テストO(n)
 FastTagの精度はほぼTagPropと同じ
14
brown, ear, painting,
woman, yellow
board, lake, man
wave, white
blue, circle, feet
round, white
drawing, hat, people
red, woman
blue, dot, feet,
microphone, statue
hair, ice, man,
white, woman
black, moon, red,
shadow, woman
LowF-1s
Figure 2. Predicted keywords using FastTag for sample images in the ESP game dataset (using all 268 keywords).
Table 1. Comparison of FastTag and TagProp in terms of
P, R, F1 score and N+ on the Corel5K dataset. Previously
reported results using other image annotation techniques
are also included for reference.
Name P R F1 N+
leastSquares 29 32 30 125
CRM (Lavrenko et al., 2003) 16 19 17 107
InfNet (M et zler & M anmat ha, 2004) 17 24 20 112
NPDE (Yavlinsky et al., 2005) 18 21 19 114
SML (Carneiro et al., 2007) 23 29 26 137
MBRM (Feng et al., 2004) 24 25 24 122
TGLM (Liu et al., 2009) 25 29 27 131
JEC (M akadia et al., 2008) 27 32 29 139
TagProp (Guillaumin et al., 2009) 33 42 37 160
Fast Tag 32 43 37 166
report the number of keywords with non-zero recall
value (N+ ). In all metrics a higher value indicates
better performance.
B aselines. We compare against leastSquares, a ridge
regression model which uses the partial subset of tags
y1, . . . , yn as labels to learn W , i.e., FastTag without
tag enrichment. We also compare against the Tag-
Prop algorithm (Guillaumin et al., 2009), a local kNN
method combining different distance metrics through
P , R, F1 score and N+ on the Espgame and IAPRTC-12
datasets.
ESP game IAPR
P R F1 N+ P R F1 N+
leastSquares 35 19 25 215 40 19 26 198
MBRM 18 19 18 209 24 23 23 223
JEC 24 19 21 222 29 19 23 211
TagProp 39 27 32 238 45 34 39 260
FastTag 46 22 30 247 47 26 34 280
4.2. Comparison wit h relat ed work
Table 1 shows a detailed comparison of FastTag to
the leastSquares baseline and eight published results
on the Corel5K dataset. We can make three obser-
vations: 1. The performance of FastTag aligns with
that of TagProp (so far the best algorithm in terms
of accuracy on this dataset), and significantly outper-
forms the other methods; 2. The leastSquares base-
line, which corresponds to FastTag without the tag
enricher, performs surprisingly well compared to exist-
ing approaches, which suggests theadvantage of a sim-
ple model that can extend to a large number of visual
descriptor, as opposed to a complex model that can af-
ford fewer descriptors. One may instead more cheaply
woman, yellow
board, lake, man
wave, white
blue, circle, feet
round, white
red, woman
blue, dot, feet,
microphone, statue
hair, ice, man,
white, woman
black, moon, red,
shadow, woman
LowF-1scor
reported results using other image annotation techniques
are also included for reference.
Name P R F1 N+
leastSquares 29 32 30 125
CRM (Lavrenko et al., 2003) 16 19 17 107
InfNet (M et zler & M anmat ha, 2004) 17 24 20 112
NPDE (Yavlinsky et al., 2005) 18 21 19 114
SML (Carneiro et al., 2007) 23 29 26 137
MBRM (Feng et al., 2004) 24 25 24 122
TGLM (Liu et al., 2009) 25 29 27 131
JEC (M akadia et al., 2008) 27 32 29 139
TagProp (Guillaumin et al., 2009) 33 42 37 160
Fast Tag 32 43 37 166
report the number of keywords with non-zero recall
value (N+ ). In all metrics a higher value indicates
better performance.
Baselines. We compare against leastSquares, a ridge
regression model which uses the partial subset of tags
y1, . . . , yn as labels to learn W , i.e., FastTag without
tag enrichment. We also compare against the Tag-
Prop algorithm (Guillaumin et al., 2009), a local kNN
method combining different distance metrics through
metric learning. It is the current best performer on
these benchmark sets. Most existing work do not pro-
vide publicly available implementations. As a result,
datasets.
ESP game IAPR
P R F1 N+ P R F1 N+
leastSquares 35 19 25 215 40 19 26 198
MBRM 18 19 18 209 24 23 23 223
JEC 24 19 21 222 29 19 23 211
TagProp 39 27 32 238 45 34 39 260
FastTag 46 22 30 247 47 26 34 280
4.2. Comparison wit h relat ed work
Table 1 shows a detailed comparison of FastTag to
the leastSquares baseline and eight published results
on the Corel5K dataset. We can make three obser-
vations: 1. The performance of FastTag aligns with
that of TagProp (so far the best algorithm in terms
of accuracy on this dataset), and significantly outper-
forms the other methods; 2. The leastSquares base-
line, which corresponds to FastTag without the tag
enricher, performs surprisingly well compared to exist-
ing approaches, which suggeststheadvantage of a sim-
ple model that can extend to a large number of visual
descriptor, as opposed to a complex model that can af-
ford fewer descriptors. One may instead more cheaply
glean the benefits of a complex model via non-linear
transformation of the features. 3. The duo classifier
formulation of FastTag, which adds the tag enricher,

16
Fast I m age Tagging
bug, green, insect,
tree, wood
baby, doll, dress,
green, hair
blue, earth, globe,
map, world
fish, fishing, fly,
hook, orange
blue, cloud, ocean,
sky, water
fly, plane, red,
sky, train
black, computer, drawing
handle, screen
woman, yellow
board, lake, man
wave, white
blue, circle, feet
round, white
red, woman
blue, dot, feet,
microphone, statue
hair, ice, man,
white, woman
black, moon, red,
shadow, woman
asian, boy, gun,
man, white
anime, comic, people,
red, woman
feet, flower, fur.
red, shoes
blue, chart, diagram,
internet, table
gray, sky, stone,
water, white
black, dark, game,
man, night
plane, red, sky,
train, truck
HighF-1scoreLowF-1scoreRandom

論文紹介 Fast imagetagging

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie 論文紹介 Fast imagetagging

Ähnlich wie 論文紹介 Fast imagetagging (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

論文紹介 Fast imagetagging