UNICAMP-UFMG at MediaEval 2012: Genre Tagging Task

UNICAMP-UFMG at MediaEval 2012:
Genre Tagging Task

´
Jurandy Almeida1 , Thiago Salles2 , Eder F. Martins2 , Ot´vio A. B. Penatti1 ,
a
Ricardo da S. Torres , Marcos A. Gon¸alves , and Jussara M. Almeida2
1
c 2

1
RECOD Lab, Institute of Computing, University of Campinas (UNICAMP)
{jurandy.almeida, penatti, rtorres}@ic.unicamp.br
2 Dept. of Computer Science, Federal University of Minas Gerais (UFMG)
{tsalles, ederfm, mgoncalv, jussara}@dcc.ufmg.br

MediaEval’12 – Pisa, Italy – October 4-5 – 2012

Genre Tagging Task at MediaEval 2012 2

Automatically assign tags to Internet videos collected from blip.tv.

The focus is on tags related to the category of the video.

Each video belongs to only one category.

Available Resources 3

Table: Resources made available for the task.

Resources Textual Visual
Used Title, Tags, Descriptions Videos
Not Used Tweets, ASR transcripts Shots and Keyframes

The 26 Genre Tags 4

Default Category Documentary
Technology
The Mainstream Media
Music and Entertainment Business
Politics
Food & Drink
Educational
Health
Videoblogging Conferences and Other Events
Religion
Literature
Movies and Television The Environment
Sports
Personal or Auto-biographical
Art School and Education
Comedy
Travel
Gaming Web Development and Sites
Citizen Journalism Autos and Vehicles

es s
5

cl te
hi si
ve nd l
d ta ca
an en hi
s m ap
to p gr
au lo io
ve -b
de to
eb au n
w or io
al at
on uc
rs ed
pe d
an
ol
ho
sc
t
ve
l en
tra nm nt
s
ro
vi ve
en re
th
e he
ot
Distribution of Genres (Development Set)

e
ur d
at an
er s
lit ce
en
er k
nf in
co dr
d
an
od
fo
th
al ia
he ed
ss m
ne am
si
bu tre
ns
ai
m
e ry
th ta
en
m
cu
do
g m
in lis
m na
ga ur
jo
en
tiz
ci
y
ed
m
co
t
ar
ts
or on
sp si
n vi
io le
lig te
re d
an
s
ie
ov ng
m g gi
lo t
ob en
de l m
vi na in
at
io rta
te
uc en
ed d
an
ic
us
m gy
lo
no
ch
te
s y
lit
ic or
po eg
at
tc
ul
fa
de
800

700

600

500

400

300

200

100

0
# of objects

Proposed Framework 6

It relies on two learning strategies:

1. Meta-learning scheme for combining
classiﬁers (stacked generalization)
trained with metadata information
model as a bag-of-words.

2. Histogram of motion patterns joint
with a KNN classiﬁer for visual
information.

Visual Approach (HMP) 7

Keyframes: Not used

Applying an algorithm to compare video sequences.
“Comparison of Video Sequences with Histograms of Motion Patterns”.
J. Almeida, N. J. Leite, and R. S. Torres.
IEEE Conference on Image Processing (ICIP) 2011.

It relies on three steps:
1. partial decoding;
2. feature extraction;
3. signature generation.

Visual Approach (HMP) 8

Video Frames

Motion Pattern 1 1 2 1
0101100110010011
2 1 0 3
Temporal Spatial

Histogram Distribution
1: Partial Decoding 3: Signature Generation

DC coeﬃcient
Time Series of Macroblocks
Pixel Block
106 111 91 94 90 93
Macroblock 2: Feature Extraction
100 88 95 90 96 91
Previous Current Next

I-frames

[Almeida et al., Comparison of Video Sequences with Histograms of Motion Patterns. ICIP 2011]

Textual Approach (BoW) 9

Title, tags, and description used as textual features.

Only metadata in English.

Porter Stemming algorithm to remove aﬃxes.

Stop words removed.

Bag-of-Words using T F xIDF as weights.

The Classifiers 10

KNN for visual features

Stacked generalization for textual features
Level 0
◮ K-Nearest Neighbors (KNN): using the k nearest training examples to assign a
class to test example.
◮ Na¨ Bayes (NB): a probabilistic learning method that aims at inferring a model
ıve
for each class by assigning to a test example the class associated with the most
probable model that would have generated it.
◮ Random Forests (RF) a variation of decision tree’s bagging, in which an ensemble
of de-correlated decision trees is learned.

Level 1
◮ Support Vector Machine (SVM): a classifier that aims at finding an optimal
separating hyperplane between the positive and negative training documents,
maximizing the distance (margin) to the closest points from either class.

Stacked Generalization 11

Figure: Stacked Generalization framework.

Experimental Protocol 12

5-fold cross validation

Development set size: 5,288 videos (∼33%)

Test set size: 10,161 videos (∼67%)

Vocabulary of 54,796 terms

Mean of 31.03 terms per video

Total number of visual features: 4,888

Mean of 3,703.59 visual features per video

Experimental Results 13

Table: MAP for each submitted run.

Experiment Input MAP
Run 1 audio/visual only, no ASR 0.1238
Run 4 everything allowed, no uploader ID 0.2112

14

s
ite l
es s ca
cl d hi
hi an p
ve nt ra
d e iog
an pm b
s lo o− n
to ve ut tio
au de a a or uc
eb al d
w on d e
rs an s
pe ol t nt
run1
run4

ho en ve
sc el nm re
v iro he
tra env ot
nd
e e a Figure: AP per class achieved in each run.
th tur es
a c
er n k
lit ere drin
nf d
co an ia
od ed
fo lth m
a s am
he s e ne tr
si ins
bu ma tary sm
e n li
th me rna
cu jou
do en
tiz
ci ing
m
ga edy
m
co n
t io
ar ts is
or ev
el
sp ics d t
lit an
po ies
Experimental Results

ov g
en
t
m ion gin m
lig og in
re obl al rta
n e
de o nt
vi cati d e
u n
ed ic a y y
r
us og o
m nol teg
ch ca
te ult
fa
de
1

0.8

0.6

0.4

0.2

0
Average Precision

Experimental Results 15

Visual features
perform better in most imbalanced classes like “food and drink” (AP = 0.372),
“school and education” (AP = 0.509), and “autos and vehicles” (AP = 0.692).

Textual features
perform better in the more frequent classes in development set as “default category”
(AP = 0.860) and “business” (AP = 0.557).

Conclusion 16

Remarks
We used different learning strategies for each data modality: video similarity for
processing visual content and an ensemble of classifiers for text-processing.

Findings
Obtained results demonstrate that the proposed framework is promising. By
combining textual and visual features, we can make a contribution to better results.

Future work
The investigation of learning strategies for combining features from different
modalities and considering other information sources, such as ASR transcripts, to
include more features semantically related to each category.

Acknowledgements 17

Organizers of Tagging Task and MediaEval 2012

Brazilian funding agencies
FAPEMIG, FAPESP, CAPES, and CNPq

UNICAMP-UFMG at MediaEval 2012: Genre Tagging Task

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie UNICAMP-UFMG at MediaEval 2012: Genre Tagging Task

Ähnlich wie UNICAMP-UFMG at MediaEval 2012: Genre Tagging Task (20)

Mehr von MediaEval2012

Mehr von MediaEval2012 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

UNICAMP-UFMG at MediaEval 2012: Genre Tagging Task