1. 2012 Tagging Task Overview
Christoph Kofler Sebastian Schmiedeke
Delft University of Technology Technical University of Berlin
Isabelle Ferrané
University of Toulouse (IRIT)
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 1
2. Motivations
• MediaEval
– evaluate new algorithms for multimedia access and retrieval
– emphasize the 'multi' in multimedia
– focus on human and social aspects of multimedia tasks.
• Tagging Task
– focus on semi-professional video on the Internet
– use features derived from various media or information sources
speech, audio, visual content or associated metadata
social information
– focus on tags related to the genre of the video
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 2
3. History
• Tagging task at MediaEval
– 2010: The Tagging Task (Wild Wild Web Version): Prediction of user tags
2010
too many tags; great diversity
– 2011: The Genre Tagging Task: Genres from blip.tv labels (26 labels)
2011
– 2012: The Tagging Task: Same labels but much more dev/test data
2012
• MediaEval joined by
– 2009 & 2010: Internal Quaero campaigns (Video Genre Classification)
2010
too few participants
– 2011 & 2012: Tagging task as an External Quaero evaluation
2012
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 3
4. Datasets
• A set of Videos (ME12TT)
– Created by the PetaMedia Network of Excellence
– downloaded from blip.tv
– episodes of shows mentioned in Twitter messages
– licensed under Creative Commons
– 14,838 episodes from 2,249 shows ~ 3,260 hours of data
– extension of the MediaEval Wild Wild Web dataset (2010)
• Split into Development and Test sets
– 2011 : 247 for development / 1,727 for test (1974 videos)
– 2012 : 5,288 for development / 9,550 for test (7.5 times more)
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 4
5. Genres
• 26 Genre labels from Blip.tv
... same as in 2011 : 25 genres + 1 default_catagory
1000 art 1001 autos_and_vehicles
1002 business 1003 citizen_journalism
1004 comedy 1005 conferences_and_other_events
1006 default_category 1007 documentary
1008 educational 1009 food_and_drink
1010 gaming 1011 health
1012 literature 1013 movies_and_television
1014 music_and_entertainment 1015 personal_or_auto-biographical
1016 politics 1017 religion
1018 school_and_education 1019 sports
1020 technology 1021 the_environment
1022 the_mainstream_media 1023 travel
1024 videoblogging 1025 web_development_and_sites
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 5
6. Information available (1/2)
• From different sources
– title, description, user tags, uploader ID, duration
– tweets mentioning the shows
– automatic processings:
• automatic speech recognition (ASR)
– English transcription
!
– some other languages Ne w
• shot boundaries and 1 keyframe per shot
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 6
7. Information available (2/2)
• Focus on speech data
– LIUM transcripts (5,084 files from dev / 6,879 files from test)
• English
• One-best hypotheses (NIST CTM format)
• Word-lattices (SLF HTK format) 4-Gram topology !
New
• Confusion network (ATT FSM-like format)
– LIMSI-VOCAPIA transcripts (5,237 files from dev. /7,215 files from test)
• English, French, Spanish, Dutch
• Language identification strategy (language confidence score)
if Score > 0.8 then transcription based on the detected language
else best score between English and the detected language
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 7
8. Task Goal
• same as in 2011
– searching and browsing the Internet for video
– using genre as a searching or organizational criterion
Videos may not be accurately or adequately tagged
Automatically assign genre labels using features derived from:
speech, audio, visual content, associated textual or social information
• what’s new in 2012
– provide a huge amount of data ( 7.5 times more )
– enable information retrieval as well as classification approaches with
more balanced datasets (1/3 dev; 2/3 test);
– each genre was “equally” distributed between both sets
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 8
9. Genre distribution over datasets
Part 1:
Genre 1000 to 1012
Part 2:
Genres 1013 to 1025
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 9
10. Evaluation Protocol
• Task: Predict the genre label for each video of the test set
• Submissions: up to 5 runs, represent different approaches
RUN 1 Audio and/or visual informations (including information about shots and keyframes)
RUN 2 ASR transcripts
RUN 3 All data except metadata
!
RUN 4 All data except the uploader ID (ID used in the 2011 campaign) New
RUN 5 All data
• Groundtruth : genre label associated to each video
• Metric : Mean Average Precision (MAP)
enable to evaluate the ranked retrieval results
regarding a set of queries Q
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 10
11. Participants
• 2012 : 20 registered, 6 submissions (10 in 2011)
– 5 veterans, 1 new participant,
– 3 organiser-connected teams, 5 countries
System Participant Supporting
Project
KIT Karlsruhe Institute of Technology Germany Quaero
!
UNICAMP-UFMG University of Campinas
New FAPEMIG, FAPESP,
CAPES, & CNPq
Federal University of Minas Gerais Brazil
ARF University Politechnica of Bucharest Romania EXCEL POSDRU
Johannes Kepler University &
Research Institute of Artificial Intelligence Austria
Polytech Annecy-Chambery France
TUB Technical University of Berlin Germany EU FP7 VideoSense
TUD-MM Delft University of Technology The Netherlands
TUD Delft University of Technology The Netherlands
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 11
12. Features
• Features used: in 2011 only - in 2011 & 2012 - in 2012 only
ASR Only transcription Or with translation of Stop word Semantic
in english non-english filtering similarity
ASR Limsi, transcription & Stemming ; TF-IDF ; Top
ASR Lium Bags of Words terms
(BoW); LDA
Audio MFCC, LPC,LPS, ZCR,
ZCR Spectrogram as an image Rhythm, timbre, onset strengh,
Energy loudness, …
Visual On image: On video; On Keyframes: Bags of Visual
Color,texture Self-similarity matrix; Words
Content SIFT / rgbSIFT / SURF / HoG
Face detection Shot boundaries;
Shot length ;
Transition between
shots ; Motion features
Metadata Title, Tags Filename, Show ID, BoW
Description, Uploader ID
Others Video from YouTube Web pages from Google, Synonyms, Social data
blip.tv Wikipedia hyponyms, domain from
Video distribution terms, BoW from Delicious
over genres (dev) Wordnet
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 12
13. Methods (1/2)
• Machine Learning approach (ML)
– parameter extraction from audio, visual, textual data, early fusion
– feature:
• transformation (PCA),
• selection (Mutual information, Term Frequency)
• dimension reduction (LDA)
– classification methods, supervised or unsupervised
• K-NN, SVM, Naive Bayes, DBN, GMM, NN
• K-Means (clustering)
• CRF (Conditional Random Fields), Decision tree (Random Forest)
– training step, cross-validation approach and stacking
– fusion of classifier results / late fusion / majority voting
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 13
14. Methods (2/2)
• Information retrieval approach (IR)
– text preprocessing & text Indexing
– query and ranking list, query expansion and re-ranking methods
– fusion of ranked lists from different modalities (RRF)
– selection of the category with the highest ranking score
• Evolution since 2011
– 2011 : 2 distinct communities : ML or IR approach
– 2012 : mainly ML approach or mixed one
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 14
15. Resources
• Which one were used?
System AUDIO ASR VISUAL METADATA SOCIAL OTHER
KIT
UNICAMP-UFMG
ARF
TUB
TUD-MM
TUD
• Evolution since 2011
– 2011 : use of external data mainly from the web and social data
1 participant especially interested in social aspect
– 2012 : no external data, no social data
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 15
16. Main results (1/2)
• Each participant’s best result
SYSTEM BEST MAP APPROACH FEATURES METHOD
RUN
KIT Run3 0.3499 ML Color, Texture, rgbSIFT SVM
Run6* 0.3581 + video distrib. over genre
UNICAMP- Run 4 0.2112 ML BoW Stacking
UFMG
ARF Run5 0.3793 ML TF-IDF mtd SVM Linear
ASR Limsi
TUB Run4 0.5225 ML BoW mtd MI - Naive Bayes
TUD-MM Run4 0.3675 ML & IR TF on Visual word SVM Linear +
ASR & mtd Reciprocal Rank
Fusion
TUD Run2 0.25 ML ASR Lium DBN
one-best
Baseline results : All videos into the default category MAP = 0.0063 / Videos randomly classified MAP = 0.002
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 16
17. Main results (2/2)
• Official run comparison (MAP)
SYSTEM Run1 Run2 Run3 Run4 Run5 Other Runs
Audio/Visual ASR Exc. Exc. Upld. ID All
MTD
KIT 0.3008 (visual1) 0.3461
0.2329 (visual2) 0.1448
0.3499 (fusion) 0.3581
UNICAMP 0.1238 0.2112
UFMG
ARF 0.1941(visual & audio) 0.2174 0.2204 0.3793
0.1892 (audio)
TUB 0.2301 0.1035 0.2259 0.5225 0.3304
TUD-MM 0.0061 0.3127 0.2279 0.3675 0.2157 0.0577
0.0047
TUD 0.23 / 0.25
0.10 / 0.09
17
Baseline results : All videosMediaEval Workshop – 4-5 October 2012 - Pisa, randomly classified MAP = 0.002
into the default category MAP = 0.0063 /Videos Italy
18. Lessons learned or open questions (1/3)
• About data
– 2011 : small dataset for developement ~ 247 videos
– Difficulties to train models, external resources required
– 2012 : huge amount of data for development ~ 5,288 videos
– Enough to train models in machine learning approach
Impact on:
- the type of methods used (ML against IR)
- the need/use of external data
No use of social data this year:
- is it a question of community ?
- can be disappointing regarding the MediaEval motivations
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 18
19. Lessons learned or open questions (2/3)
• About results
– 2011 : Best system MAP 0.5626 using
– Audio, ASR, visual, metadata including the uploader ID
– external data from Google and Youtube
– 2012 : Best system (non-organiser connected) : MAP 0.3793
– TF-IDF on ASR, metadata including the uploader ID, no visual data
Best system (organizer-connected): MAP 0.5225
– BoW on metadata without the uploader ID, no visual data
results difficult to compare regarding the great diversity of features,
of methods and of systems combining both
monomedia (visual only ; ASR only) or multimedia contributions
a failure analysis should help to understand « what impacts what? »
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 19
20. Lessons learned or open questions (3/3)
• About metric
– MAP as the official metric
– some participants provided other types of results in terms of correct
classification rate or F-score or detail AP results per genre
would « analysing the confusion between genres » be of interest?
• About genre labels
– Labels provided by blip.tv are covering two aspects
• topics: Autos_and_vehicules, Health, Religion, ...
• Real genre : Comedy, Documentary, Videoblogging,...
would « making a distinction between form and content » be of
interest?
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 20
21. Conclusion
• What to do next time?
– Has everything been said?
– Should we leave the task unchanged?
– If not, we have to define another orientation
– Should we focus on another aspect of the content?
• Interaction,
• Mood,
• User intention regarding the query
– Define what needs to be changed:
• Data
• Goals and use cases
• Metric
• ... a lot of points need to be considered
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 21
22. Tagging Task Overview
More details about Metadata
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 22
24. Tagging Task Overview : Examples
• Example of Medadata from
Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv
Tag: 1012 literature Genre label from blip.tv
<video>
<title> <![CDATA[One Minute Rumpole and the Angel Of Death]]> </title>
<description> <![CDATA["Rumpole and the Angel of Death," by John
Mortimer, …]>
</description>
<explicit> false </explicit>
<duration> 66 </duration>
<url> http://blip.tv/file/1271048 </url>
<license>
<type> Creative Commons Attribution-NonCommercial 2.0 </type>
<id> 4 </id>
</license>
…
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 24
25. Tagging Task Overview : Examples
• Example of Medadata from
…
<tags> <string> oneminutecritic </string>
<string> fvrl </string>
<string> vancouver </string> Tags given by the uploader
<string> library </string>
<string> books </string>
</tags>
<uploader> <uid> 112708 </uid> ID of the uploader
<login> crashsolo </login>
</uploader>
<file> <filename> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv
</filename>
<link> http://blip.tv/file/get/… </link>
<size> 3745110 </size>
</file>
<comments />
</video>
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 25
26. Tagging Task Overview : Examples
• Video data (420,000 shots and keyframes)
Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv
Tag: 1012 literature Genre label from blip.tv
<?xml version="1.0" encoding="utf-8" ?>
<Segmentation>
<CreationID> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659 </CreationID>
<InitialFrameID> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659/394.jpg
</InitialFrameID>
<Segments type="SHOT"> Shot
<Segment start="T00:00:00:000F1000" end="T00:00:56:357F1000">
<Index> 0 </Index> boundaries
<KeyFrameID time="T00:00:28:142F1000">
CrashsoloOneMinuteRumpoleAndTheAngelOfDeath659/394.jpg
</KeyFrameID>
</Segment>
... One keyframe per shot
</Segments>
</Segmentation>
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 26
27. Tagging Task Overview : Metadata
• Social data (8,856 unique twitter users)
Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv
Tag: 1012 literature Genre label from blip.tv
Code: 1271048
Post: http://blip.tv/file/1271048
http://twitter.com/crashsolo
Posted 'One Minute Rumpole and the Angel Of Death' to blip.tv:
http://blip.tv/file/1271048
Upload a file on blip.tv
Level 0 & Post a tweet (level 0)
User
User’s contacts Level 1 Level 2
Contacts’own contacts
MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 27
Hinweis der Redaktion
In this overview, I wont’t detail again the main motivations of mediaeval. I will directly jump to what motivated the Tagging task. The idea was first - to focus on multimedia content available on the internet through semi-professional videos provided by Blip.tv - to take into account the « multi » aspect of multimedia, the use of features derived from various media was encouraged. This means speech, audio, visual content and associated metadata, the social aspect was introduced through the use of social data the genre was considered as a mean to access and retrieve such multimedia content
A little bit of history first...before giving details about submissions and results In 2010, the tagging task at MediaEval aimed at predicting the tags freely given by users uploading a video. But the great diversity and amount of tags made the task difficult. In 2011 the idea was to reduce the number of tags by focusing on the genre labels provided by Blip.tv. And a subset of the data collected for the wild wild web was used. In 2012, we kept the same goal but the amount of data has been increased. At the same time, within the Quaero project a task of Video genre classification was opened to participants to the project. The evaluation of this task was done internally in 2009 and 2010, but the weak number of participants did not made the task challenging. So for two years, MediaEval has been joined by Quaero and the Tagging task evaluated by MediaEval became an external evaluation of the Quaero multimedia content recognition task. .
The taging task is based on a corpus created by the PetaMedia Network of excellence. Video were downloaded from blip.tv. These videos were selected according to two main criteria : the mention in twitter of the corresponding show and the Creative Common license More than fourteen thousand episodes where selected representing more than 3 thousand hours of data. The corpus used this year is an extension of the data set proposed by the Wild Wild Web task. All the videos taken into account were split in two subsets: one for the development and one for the test. In 2012 the amount of data was 7.5 times bigger. More than 5 thousand videos for the development set and almost 10 thousand for the test.
These video as last year are organised around 25 genres. The amount of videos per genre not been balanced, a default category has been created including genres with less than 10 episodes. Some genre are more related to a topic like sports, literature ... Some others to a ‘real genre’ like comedy, documentary, videoblogging, ...
Each video comes along with different sources of information about their content. An Xml file contains metadata from blip.tv about the title, the video description, the tags given by the user who uploaded the video, its uploader id, and so on. Information about different level of tweets mentionning the episode is also available Automatic processing results are provided given infortaion on the speech and visual content. For speech as last year LIMSI-VOCAPIA a French research and development group, provided speech recognition transcription in english and other languages This year the Speech reserach group of a French university also provided transcriptions for english content The visual content was provided by the technical university of Berlin Let’s show some examples
Depending on the provider, the speech transcription is available under different forms. The LIUM provided the transcription one-best hypotheses, as well as word lattices and confusion matrix. LIMSI-VOCAPIA provided transcription in different language according to a language identification strategy beased on a language confidence scrore. Depending on a threshold and under this threshold on a best score
The goal of the Taggin task this year is the same as in 2011 Searching and browsing the internet looking for videos and using the genre as a searching or an organisational criterion could be a good idea, But the main drawback is that most of the video available may not be accurately or adequately tagged So the idea of the Tagging task is to automatically assign a genre label using sets of features derived from different media. That’s what we did in 2011. So what is new in 2012 ? - provide a huge amount of data - to enable information retrieval as well as classification or machine learning approches - by providing more balanced dataset a third videos for development the two other third for test Each genre being balanced between dev and test sets
The goal of the Taggin task this year is the same as in 2011 Searching and browsing the internet looking for videos and using the genre as a searching or an organisational criterion could be a good idea, But the main drawback is that most of the video available may not be accurately or adequately tagged So the idea of the Tagging task is to automatically assign a genre label using sets of features derived from different media. That’s what we did in 2011. So what is new in 2012 ? - provide a huge amount of data - to enable information retrieval as well as classification or machine learning approches - by providing more balanced dataset a third videos for development the two other third for test Each genre being balanced between dev and test sets
The evaluation protocol to be followed by the participants was to use these data separatly or in a combined way to predict the genre label for each video of the test set In order to compare the resullts as well as the impact of different media, 5 official runs were defined. Each participant had to submit up to 5 of these runs. Extra runs were also allowed. The first run was based on the use of audio and or visual content. The second one had to focus on speech through ASR results The last three runs enables to study the impact of metadata, no metadata, metadata but not the uploader ID which boosted the results last year and all the metadata available. The groundtruth is based on the genre label associated to the video on blip.tv The official metric is the Mean Average Precision in order to evaluate the ranked retrieval results regarding a set of queries concerning the genre.
This year we had 6 participants less than last year where we had 10 submissions. Four of this participants were here last year. And we had a new one from Brazil. As they will present themselves just after, I am not going to detail this. Some system are in grey, it is because one task organisation member has contributed to these work.
Emphazise the diffrence between the data
I will talk about what motivates this task and I will say few words about its history I will briefly describe the datasets, their content and the genres taken into account in this task Then I will present the goal of the task and the evaluation protocol followed I will try to give a synthetic view of the various features and methods used by the participants to this task. I will give an overview of the resources really used and I will present the main results obtained. Before each participant gives you more details about what they did in their experiments I will give some elements of conclusion.
In the metadata from Blip.TV we can found the genre label, The title, the description
A set of tags freely given by the uploader, the uploader ID And other information
The visual information give a shot segmentation of the video and for each shot its boundaries and one keyframe
The social Data from twitter are based on the uploader tweet annoncing its video post and on his contacts, and contacts’contacts relaying the tweet.