Overview of the MediaEval 2012 Tagging Task

2012 Tagging Task Overview
Christoph Kofler Sebastian Schmiedeke
Delft University of Technology Technical University of Berlin

Isabelle Ferrané
University of Toulouse (IRIT)

MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 1

Motivations

• MediaEval
– evaluate new algorithms for multimedia access and retrieval
– emphasize the 'multi' in multimedia
– focus on human and social aspects of multimedia tasks.

• Tagging Task
– focus on semi-professional video on the Internet
– use features derived from various media or information sources
speech, audio, visual content or associated metadata
social information
– focus on tags related to the genre of the video


History
• Tagging task at MediaEval
– 2010: The Tagging Task (Wild Wild Web Version): Prediction of user tags
2010
 too many tags; great diversity
– 2011: The Genre Tagging Task: Genres from blip.tv labels (26 labels)
2011
– 2012: The Tagging Task: Same labels but much more dev/test data
2012

• MediaEval joined by
– 2009 & 2010: Internal Quaero campaigns (Video Genre Classification)
2010
 too few participants
– 2011 & 2012: Tagging task as an External Quaero evaluation
2012


Datasets

• A set of Videos (ME12TT)
– Created by the PetaMedia Network of Excellence
– downloaded from blip.tv
– episodes of shows mentioned in Twitter messages
– licensed under Creative Commons
– 14,838 episodes from 2,249 shows ~ 3,260 hours of data
– extension of the MediaEval Wild Wild Web dataset (2010)

• Split into Development and Test sets
– 2011 : 247 for development / 1,727 for test (1974 videos)
– 2012 : 5,288 for development / 9,550 for test (7.5 times more)


Genres
• 26 Genre labels from Blip.tv
... same as in 2011 : 25 genres + 1 default_catagory
1000 art 1001 autos_and_vehicles
1002 business 1003 citizen_journalism
1004 comedy 1005 conferences_and_other_events
1006 default_category 1007 documentary
1008 educational 1009 food_and_drink
1010 gaming 1011 health
1012 literature 1013 movies_and_television
1014 music_and_entertainment 1015 personal_or_auto-biographical
1016 politics 1017 religion
1018 school_and_education 1019 sports
1020 technology 1021 the_environment
1022 the_mainstream_media 1023 travel
1024 videoblogging 1025 web_development_and_sites


Information available (1/2)

• From different sources
– title, description, user tags, uploader ID, duration

– tweets mentioning the shows

– automatic processings:
• automatic speech recognition (ASR)
– English transcription
!
– some other languages Ne w

• shot boundaries and 1 keyframe per shot

Information available (2/2)

• Focus on speech data
– LIUM transcripts (5,084 files from dev / 6,879 files from test)
• English
• One-best hypotheses (NIST CTM format)
• Word-lattices (SLF HTK format) 4-Gram topology !
New
• Confusion network (ATT FSM-like format)

– LIMSI-VOCAPIA transcripts (5,237 files from dev. /7,215 files from test)
• English, French, Spanish, Dutch
• Language identification strategy (language confidence score)
if Score > 0.8 then transcription based on the detected language
else best score between English and the detected language


Task Goal
• same as in 2011
– searching and browsing the Internet for video
– using genre as a searching or organizational criterion
 Videos may not be accurately or adequately tagged
 Automatically assign genre labels using features derived from:
speech, audio, visual content, associated textual or social information

• what’s new in 2012
– provide a huge amount of data ( 7.5 times more )
– enable information retrieval as well as classification approaches with
more balanced datasets (1/3 dev; 2/3 test);
– each genre was “equally” distributed between both sets


Genre distribution over datasets

Part 1:
Genre 1000 to 1012

Part 2:
Genres 1013 to 1025


Evaluation Protocol
• Task: Predict the genre label for each video of the test set
• Submissions: up to 5 runs, represent different approaches
RUN 1 Audio and/or visual informations (including information about shots and keyframes)
RUN 2 ASR transcripts
RUN 3 All data except metadata
!
RUN 4 All data except the uploader ID (ID used in the 2011 campaign) New
RUN 5 All data

• Groundtruth : genre label associated to each video
• Metric : Mean Average Precision (MAP)
 enable to evaluate the ranked retrieval results
regarding a set of queries Q

Participants
• 2012 : 20 registered, 6 submissions (10 in 2011)
– 5 veterans, 1 new participant,
– 3 organiser-connected teams, 5 countries
System Participant Supporting
Project
KIT Karlsruhe Institute of Technology Germany Quaero

!
UNICAMP-UFMG University of Campinas
New FAPEMIG, FAPESP,
CAPES, & CNPq
Federal University of Minas Gerais Brazil

ARF University Politechnica of Bucharest Romania EXCEL POSDRU
Johannes Kepler University &
Research Institute of Artificial Intelligence Austria
Polytech Annecy-Chambery France

TUB Technical University of Berlin Germany EU FP7 VideoSense

TUD-MM Delft University of Technology The Netherlands

TUD Delft University of Technology The Netherlands


Features
• Features used: in 2011 only - in 2011 & 2012 - in 2012 only
ASR Only transcription Or with translation of Stop word Semantic
in english non-english filtering similarity
ASR Limsi, transcription & Stemming ; TF-IDF ; Top
ASR Lium Bags of Words terms
(BoW); LDA

Audio MFCC, LPC,LPS, ZCR,
ZCR Spectrogram as an image Rhythm, timbre, onset strengh,
Energy loudness, …

Visual On image: On video; On Keyframes: Bags of Visual
Color,texture Self-similarity matrix; Words
Content SIFT / rgbSIFT / SURF / HoG
Face detection Shot boundaries;
Shot length ;
Transition between
shots ; Motion features

Metadata Title, Tags Filename, Show ID, BoW
Description, Uploader ID

Others Video from YouTube Web pages from Google, Synonyms, Social data
blip.tv Wikipedia hyponyms, domain from
Video distribution terms, BoW from Delicious
over genres (dev) Wordnet

Methods (1/2)

• Machine Learning approach (ML)
– parameter extraction from audio, visual, textual data, early fusion
– feature:
• transformation (PCA),
• selection (Mutual information, Term Frequency)
• dimension reduction (LDA)
– classification methods, supervised or unsupervised
• K-NN, SVM, Naive Bayes, DBN, GMM, NN
• K-Means (clustering)
• CRF (Conditional Random Fields), Decision tree (Random Forest)
– training step, cross-validation approach and stacking
– fusion of classifier results / late fusion / majority voting


Methods (2/2)

• Information retrieval approach (IR)
– text preprocessing & text Indexing
– query and ranking list, query expansion and re-ranking methods
– fusion of ranked lists from different modalities (RRF)
– selection of the category with the highest ranking score

• Evolution since 2011
– 2011 : 2 distinct communities : ML or IR approach
– 2012 : mainly ML approach or mixed one


Resources
• Which one were used?
System AUDIO ASR VISUAL METADATA SOCIAL OTHER
KIT
UNICAMP-UFMG
ARF
TUB
TUD-MM
TUD

• Evolution since 2011
– 2011 : use of external data mainly from the web and social data
1 participant especially interested in social aspect
– 2012 : no external data, no social data

Main results (1/2)

• Each participant’s best result
SYSTEM BEST MAP APPROACH FEATURES METHOD
RUN
KIT Run3 0.3499 ML Color, Texture, rgbSIFT SVM
Run6* 0.3581 + video distrib. over genre

UNICAMP- Run 4 0.2112 ML BoW Stacking
UFMG
ARF Run5 0.3793 ML TF-IDF mtd SVM Linear
ASR Limsi

TUB Run4 0.5225 ML BoW mtd MI - Naive Bayes

TUD-MM Run4 0.3675 ML & IR TF on Visual word SVM Linear +
ASR & mtd Reciprocal Rank
Fusion

TUD Run2 0.25 ML ASR Lium DBN
one-best

Baseline results : All videos into the default category MAP = 0.0063 / Videos randomly classified MAP = 0.002


Main results (2/2)

• Official run comparison (MAP)
SYSTEM Run1 Run2 Run3 Run4 Run5 Other Runs
Audio/Visual ASR Exc. Exc. Upld. ID All
MTD

KIT 0.3008 (visual1) 0.3461
0.2329 (visual2) 0.1448
0.3499 (fusion) 0.3581
UNICAMP 0.1238 0.2112
UFMG
ARF 0.1941(visual & audio) 0.2174 0.2204 0.3793
0.1892 (audio)
TUB 0.2301 0.1035 0.2259 0.5225 0.3304

TUD-MM 0.0061 0.3127 0.2279 0.3675 0.2157 0.0577
0.0047
TUD 0.23 / 0.25
0.10 / 0.09
17
Baseline results : All videosMediaEval Workshop – 4-5 October 2012 - Pisa, randomly classified MAP = 0.002
into the default category MAP = 0.0063 /Videos Italy

Lessons learned or open questions (1/3)
• About data
– 2011 : small dataset for developement ~ 247 videos
– Difficulties to train models, external resources required

– 2012 : huge amount of data for development ~ 5,288 videos
– Enough to train models in machine learning approach

 Impact on:
- the type of methods used (ML against IR)
- the need/use of external data

 No use of social data this year:
- is it a question of community ?
- can be disappointing regarding the MediaEval motivations



• About results
– 2011 : Best system MAP 0.5626 using
– Audio, ASR, visual, metadata including the uploader ID
– external data from Google and Youtube

– 2012 : Best system (non-organiser connected) : MAP 0.3793
– TF-IDF on ASR, metadata including the uploader ID, no visual data
Best system (organizer-connected): MAP 0.5225
– BoW on metadata without the uploader ID, no visual data

 results difficult to compare regarding the great diversity of features,
of methods and of systems combining both
 monomedia (visual only ; ASR only) or multimedia contributions
 a failure analysis should help to understand « what impacts what? »



• About metric
– MAP as the official metric
– some participants provided other types of results in terms of correct
classification rate or F-score or detail AP results per genre
 would « analysing the confusion between genres » be of interest?

• About genre labels
– Labels provided by blip.tv are covering two aspects
• topics: Autos_and_vehicules, Health, Religion, ...
• Real genre : Comedy, Documentary, Videoblogging,...
 would « making a distinction between form and content » be of
interest?


Conclusion

• What to do next time?
– Has everything been said?
– Should we leave the task unchanged?
– If not, we have to define another orientation
– Should we focus on another aspect of the content?
• Interaction,
• Mood,
• User intention regarding the query
– Define what needs to be changed:
• Data
• Goals and use cases
• Metric
• ...  a lot of points need to be considered

Tagging Task Overview
More details about Metadata


Tagging Task Overview

• Motivations & History
• Datasets, Genres, Metadata & Examples
• Task Goal & Evaluation protocol
• Participants, Features & Methods
• Resources & Main results
• Conclusion


Tagging Task Overview : Examples

• Example of Medadata from
Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv
Tag: 1012 literature Genre label from blip.tv

<video>
<title> <![CDATA[One Minute Rumpole and the Angel Of Death]]> </title>
<description> <![CDATA["Rumpole and the Angel of Death," by John
Mortimer, …]>
</description>
<explicit> false </explicit>
<duration> 66 </duration>
<url> http://blip.tv/file/1271048 </url>
<license>
<type> Creative Commons Attribution-NonCommercial 2.0 </type>
<id> 4 </id>
</license>
…


• Example of Medadata from
…
<tags> <string> oneminutecritic </string>
<string> fvrl </string>
<string> vancouver </string> Tags given by the uploader
<string> library </string>
<string> books </string>
</tags>
<uploader> <uid> 112708 </uid> ID of the uploader
<login> crashsolo </login>
</uploader>
<file> <filename> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv
</filename>
<link> http://blip.tv/file/get/… </link>
<size> 3745110 </size>
</file>
<comments />
</video>


• Video data (420,000 shots and keyframes)

<?xml version="1.0" encoding="utf-8" ?>
<Segmentation>
<CreationID> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659 </CreationID>
<InitialFrameID> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659/394.jpg
</InitialFrameID>
<Segments type="SHOT"> Shot
<Segment start="T00:00:00:000F1000" end="T00:00:56:357F1000">
<Index> 0 </Index> boundaries
<KeyFrameID time="T00:00:28:142F1000">
CrashsoloOneMinuteRumpoleAndTheAngelOfDeath659/394.jpg
</KeyFrameID>
</Segment>
... One keyframe per shot
</Segments>
</Segmentation>

Tagging Task Overview : Metadata

• Social data (8,856 unique twitter users)

Code: 1271048
Post: http://blip.tv/file/1271048
http://twitter.com/crashsolo
Posted 'One Minute Rumpole and the Angel Of Death' to blip.tv:
http://blip.tv/file/1271048

Upload a file on blip.tv
Level 0 & Post a tweet (level 0)
User

User’s contacts Level 1 Level 2
Contacts’own contacts


Overview of the MediaEval 2012 Tagging Task

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (10)

Andere mochten auch

Andere mochten auch (18)

Ähnlich wie Overview of the MediaEval 2012 Tagging Task

Ähnlich wie Overview of the MediaEval 2012 Tagging Task (20)

Mehr von MediaEval2012

Mehr von MediaEval2012 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Overview of the MediaEval 2012 Tagging Task

Hinweis der Redaktion