2. Problem Statement
Large number of devices which can take pictures and videos lead to an increase in uploaded
multimedia content
300 hours of video uploaded to YouTube each hour
3.25 billion hours of YouTube videos watched every month
By 2020, Cisco forecasts it would take 5 million years for a person to watch every online video
3. Difference between video topic and description
Recipe :
relevant
for video
Subscriber channels
- not related to video
subject
4. Same video type - different categories
Both cooking-related videos, yet appear in different categories
6. Findability Problem
Problem : content becomes less and less findable
How can we fix this?
Annotating the videos could improve findability
7. User annotation
2 problems :
● Small number of tags (average 9 per video)
● May contain irrelevant tags - to gain more views
Alternative : automatically annotate the videos
8. Solution : Automatic annotation
● Process video streams (Google Video Intelligence, Clarifai API)
● Process video subtitles (Alchemy API, Google Natural Language)
● Tools processing the same type of data - likely yield different results - COMBINE THEM
● Disadvantage : video tools only able to provide content information, text tools only able to
provide context information
● Best approach - Combine tools which process different video dimensions
10. Related work
● Concept detection in video relies
mostly on using low level image
attributes (e.g. color
histograms)-Lin et al, Chang et al
● Detecting concepts in subtitles -
used to assign categories to videos
(Katsiouli et al) or as basis for
finding other relevant entities
(Garcia et al)
● Crowdsourcing concepts -
encourage users to play games
and draw outlines of objects in
the video (Di Salvo et al) or
(Kavasidis et al)
Color histograms for similar images
11. Research Question
Main Research Question : How can we identify key topics in a video through processing
of the video stream and its textual description?
Two sub-research questions:
● How can we determine if certain concepts are more relevant than others ? (RQ1)
● How can we best align the concepts from the input sources(video stream and
transcript) ? (RQ2)
12. Dataset
● YouTube videos
● Of various types and lengths
● Aimed to select videos which do
not fit in more than one category
● In total 519 videos
13. Tools
● 1 subtitle processing tool - Google
Natural Language [1]
○ Outputs detected concepts in
the order of appearance
● 2 video processing tools - Clarifai[2]
and Google Video Intelligence[3]
○ We chose these tools because
the alternatives break down the
video into keyframes and
perform concept detection on
images, rather than video
Clarifai GVI
Output format JSON JSON
Tags per second yes no
Tags ordered
alphabetically
no yes
Occurrences of
same tags grouped
together
no yes
Confidence score
for tag
yes yes
‘Video relevant’
label for tags
no yes
[1] https://cloud.google.com/natural-language/
[2] https://www.clarifai.com/
[3] https://cloud.google.com/video-intelligence/
14. GVI Sample Output
TAG with single
occurence
Multiple occurrences of
same tag are grouped
Tag relevant at video
level
15. Clarifai Sample Output - list of vectors
Vector of seconds
Vector of
concepts for
each second
Vector of probabilitie
for each concept
16. After running the tools on the dataset...
● Clarifai - highest number of tags
● Subtitles - lowest number of tags
● Large amount of unique tags between the tools - thus overlap is low
18. Research Methodology
Tag Processing (for each topl separately)
Step 1 : calculate number of occurrences and
longest time interval of each tag
In Figure 6:
Black
● Number of occurrences = 1
● Longest time interval = 5 (last - first)
Classroom
● Number of occurrences = 3
● Longest time interval = 5 (last - first)
19. Research Methodology
Step 2 : Transform confidence scales so
the tag with highest confidence score
ends up having confidence = 1 (highest
confidence score becomes divisor)
a. Recalculate confidence for all
other tags
In the example to the right : text has the
highest confidence score - use that as
divisor
20. Research Methodology
Step 3 : calculate relevance score for each tag
Sum of confidence scores / video length in
seconds
Step 4 : combine tags from the three different
outputs
● Use average formula
● If tag detected by tool, use relevance, if not,
use 0
Combining tags from the three tools
21. Evaluation Goals
We have identified 4 evaluation goals
● Confirm our computations (EV1)
● Check for bias towards one of the tools (EV2)
● Check for any correlation between bias and video characteristics (EV3)
● Check if the automatic tools may have missed something (EV4)
Evaluate using crowdsourcing
22. Strategy for Selecting Videos to Evaluate
Choose a sample of videos which have high overlap between the 3 tools
Because it was concluded that shorter videos are more suitable for crowdsourcing (workers tend to lose focus
for longer videos) we decided to show 10 second segments of video
From the sample of videos - pick 10 second segments to evaluate
Pick those segments from each video in which highly relevant (as resulted after combining the outputs of the
tools) tags occur
In total, 2169 segments to be evaluated, from 213 videos
23. Selecting Tags to display
For each segment of video - compose a list of most relevant, maybe relevant and not so relevant tags from the tags for the
overall video
AT most 10 tags in each category
3 variables help to assign tags to categories:
1. Max relevance score for segment (MaxConf)
2. Tag’s relevance score (Rel)
3. A relevance threshold (Thresh - is 0.2 is MaxConf > 0.2 and is = 0.02 if MaxConf <=0.2)
Assign tags in categories:
1. If MaxConf - Thresh < Rel < MaxConf AND less than 10 tags in category => put tag in that category
2. Repeat until rule no longer holds or more than 10 tags in category
3. MaxConf = MaxConf - Thresh
4. Repeat until categories full or no more tags
24. Crowdsourcing Task
● Ask users to watch 10 seconds of video
● Users can then select tags related to the video from the list
● Users can add any other tags they think are relevant to the segment
● Each task is evaluated by 15 workers
● Each worker gets 2 cents for each completed task
● Workers cannot submit the results without watching all 10 seconds
26. Evaluation Results - EV1
At segment level, an average of 41.74% of highly relevant tags (as evaluated by the crowdsourcing
workers) were correctly detected by the algorithm
Maybe relevant tags - smallest overlap of all
Additional subtitle tags (not detected by any tool other than subtitles) have highest overlap - BUT we
counted each tag chosen by at least one worker in the same category (perhaps relevance is low ?)
27. Evaluation Results - EV1
At video level, an average of 46.19% of the tags which were evaluated as being highly relevant
by the workers were also detected by the algorithm as being highly relevant
Same as segment level, medium relevance tags have lowest overlap
Low relevance tags slightly higher overlap than high relevance tags - for very short videos, there
is higher overlap for highly relevant videos
28. Evaluation Results - EV2
● Clarifai - mainly low relevance tags
● Most high relevance tags were
detected by both visual processing
tools
● Bias towards choosing tags
detected by more than one tool
29. Evaluation Results - EV3
By assigning numerical values to the time
distribution and tag category, we were able to
calculate correlation with the help of the
corresponding Excel function.
Assignment: {100, 200, 300, 400} corresponds to
{under 3 min, 3-5 min., 5-10 min and 10-15 min}
{10, 20, 30, 40, 50} corresponds to {clarifai + gvi,
clarifai + gvi + sub, clarifai, gvi, sub}
High correlation score for cooking between time
distribution and processing tool.
Inexistent or not very strong correlation for the
other 4 categories (for nature there is a
correlation, but very light)
30. Evaluation Results - EV3
Using the same assignment as in
the previous slide for the tag
detection tools, we assigned
{1,2,3,4,5} to be the alias of
{cooking, culture, nature,
travel,other}
For all time distributions,
correlation factor is negative
No apparent correlation between
categories and detection tools in
any time distribution.
31. Evaluation Results - EV4
● Only about 20% additional tags found in out lists
● Most of them with low relevance
32. Evaluation Result - RQ1
● Identified a bias towards choosing tags detected by more than one tool
● These should be higher up in the list
● Better alignment strategy : instead of simple average, use a weighted average
● Assign higher weight to tags detected by more than one tool
33. Evaluation Result - RQ2
● Current alignment detects 46.19% of highly relevant tags for the sampled videos
(comparison between the highly relevant tags detected by our algorithm and the highly
relevant tags chosen by crowdsourcing workers)
● There is a percentage of tags detected as being of medium relevance which have been
promoted to high relevance after crowdsourcing
● Find a better relevance threshold
34. Evaluation result - RQ2
Examined users choice behaviour for each category (the other three categories on next slide) to see
whether combining tools results in more accurate results
● For each category, tags selected by GVI + Clarifai are chosen more often that either Clarifai or GVI
separately
● Adding subtitles does not make much of a difference (highest overlap score for highly relevant tags
happens for tags detected by GVI+Clarifai
● Subtitles have the least chosen amount of tags ( remember that subtitle tags included here are not
detected by any other tool)
35. Combining visual tools - better than
using them individually
Combining visual tags with subtitle -
better than using just subtitles
Linear increase in tags - as relevance
decreases - number of tags increases
36. Conclusion and Future Work
● Our alignment strategy correctly detects around 46% of relevant tags for sampled videos
● Wanted to find out whether combining tools would yield better results
○ Tags from GVI are chosen more often than Clarifai tags
○ Most tags for sampled videos come from GVI + Clarifai - more relevant
○ Adding subtitles to visual tags -better than using just subtitle tags
● Differences between video categories are not that many - can use them as one single
dataset
● Related work deals mostly with one source of information, whereas we deal with
information from 3 different sources
○ Also mostly concerned with aligning tags to parts of the video, whereas we tried to find tags
relevant to the whole video.
● Our algorithm can be improved
● Include crowdsourcing to identify better threshold, not just for confirmation
● Use weighted average as part of alignment
38. References
C. Y. Lin et al ‘VideoAL: A novel End-To-End MPEG-7 Video Automatic Labeling System’(2003)
Chang, S. F. and Ellis, D. and Jiang, W. and Lee, K. and Yanagawa, A. and Loui, A. C. and Luo, J :
‘Large-Scale Multimodal Semantic Concept Detection for Consumer Video ‘ (2007)
Katsiouli, P. and Tsetsos, V. and Hadjiefthymiades, S. : Semantic Video Classification Based on
Subtitles and Domain Terminologies (2007)
Garcia, J. L. R. and Vocht, L. and Troncy, R. and Mannens, E. and Van de Walle, R. : Describing and
contextualizing events in TV news shows (2014)
Di Salvo, R. and Giordano, D. and Kavasidi, I : A Crowdsourcing Approach to Support Video
Annotation (2014)
Kavasidis, I. and Palazzo, S. and Di Salvo, R. and Giordano, D. and Spampinato, C. : An innovative
web-based collaborative platform for video annotation (2013)